3 Common Pitfalls to Embracing the Data Lake Architecture with AWS Athena
Data Lakes have emerged as the standard for big data management, with data virtualization becoming the predominant approach for versatile large-scale analytics.
AWS Athena Leverages Presto to Deliver Modern Data Virtualization
• Saves you the overhead from authoring and maintaining various ETLs. Not only do you save time and money now, but you’re also more versatile for future changes. Much easier to respond to changing business requirements.
• The data is as fresh as possible.
• All the data is accessible in its most granular form. Most ETLs will result in data granularity loss, leaving only the “important” parts of the data. Often this results in the loss of the ability to drill down.
Due to its unique advantages, Presto has quickly become a tool of choice for many data driven companies.
Serve Interactive Queries Without Compromising on Agility
As Athena administrators know, Athena works exceptionally well out of the box until users run into performance issues. As a result, when an Athena deployment gains adoption in an organization, users run into roadblocks trying to productionalize the system. For example, some queries incur expensive and time consuming scans, which means users can’t reliably power real time dashboards. Users also run into issues getting predictable query times when issuing complicated joins. Subsequently many data teams revert back to a data warehouse, which instantly voids all of these benefits as suddenly data engineering teams need to deal with data migration, consistency, multiple permission models, and users struggling with finding data across multiple data catalogs. These common performance issues have limited the spectrum of relevant use cases that can run on Athena, which often ends up as a platform only for ad hoc queries and not business-critical analytics.
Visibility & Control
Even though core Presto has powerful tools for optimization, a zero devops solution such as Athena doesn’t include any tooling for analyzing performance issues. Admins don’t have access to continuous monitoring and deep visibility into how workloads perform, how resources are used on an hourly / weekly basis, who are the heavy spenders, what is the “hottest data”, etc. Athena administrators often find they need domain expertise to understand the data that users are querying and to optimize users’ workflow. Administrators struggle to help users because there is no good way to analyze what Athena is doing under the covers. Since data sets and use cases change quickly, any hardearned gains through manual optimization goes out the window.
Full Scans Often Result in Spiraling & Unpredictable Costs
Athena is priced based on data scanned. Seems very simple and easy to understand. But where it shines on simplicity it fails on predictability. Data teams often struggle to estimate the level of performance users should expect and don’t have the tools to estimate cost and ensure it matches the allocated budget. Unlike other popular data platforms, Athena doesn’t include query acceleration options that offer users and data platform teams the ability to consistently meet performance and concurrency requirements. The “serverless” nature of Athena brings tremendous benefits in ease of use, but when it comes to managing budgets and business requirements, data teams are forced to manually optimize data even before it hits the platform, which again misses the goal of a modern data lake architecture.
Varada’s Fresh Approach to Data Virtualization Puts Indexing at the Front
Varada’s dynamic indexing technology eliminates the need for full scans and can accelerate queries automatically without any overhead to query processing or any background data maintenance.
Users see performance benefits when filtering, joining and aggregating data. Varada transparently applies indexes to any SQL WHERE clause, on any column, within an SQL statement. Indexes are used for point lookups, range queries and string matching of data in nanoblocks.
Varada automatically detects and uses indexes to accelerate JOINs using the index of the key column.
Varada indexes can be used for dimensional JOINs combining a fact table with a filtered dimension table, for self-joins of fact tables based on time or any other dimension as an ID, and for joins between indexed data and federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well.
Varada’s indexing works transparently for users and indexes are managed automatically by Varada’s proprietary costbased optimizer extensions. Varada’s unique indexing efficiently indexes data directly from the data lake across any column so that every query is optimized automatically.
Varada indexes adapt to changes in data over time by splitting each column into small chunks, called nanoblocks.
To ensure fast performance for every query, Varada dynamically selects from a set of indexing algorithms and indexing parameters that adapt and evolve as data changes to ensure best fit index any data nanoblock.
The Ultimate Data Democratization Solution:
Serving a Wide Range of Queries and Use Cases
Whether you’re considering more cost-effective architectures for a cloud data lake or have already gotten started with Presto and Athena, you’ll find a lot of success with an AWS Athena based solution paired with Varada.
Athena brings the reality of a no-devops query engine to AWS based data lakes, enabling true data virtualization without costly data duplication and brittle data movement.
For large-scale use cases, Varada enables data architects to seamlessly accelerate and optimize workloads to meet specific performance and cost requirements with zero data-ops and effective resource utilization.
How to Avoid Trading DevOps Savings for DataOps Costs
Why Visibility is Critical for Success in Enterprise Data Analytics
As a data team, you’re drowning in demand from users requesting analytics access to your corporate data lake. There’s a good reason to welcome these requests. You can handle these user requests in one of two ways: either start to move subsets of data out to externally managed systems or provide a query engine to access data directly in the data lake.
There's a New Standard for Data Virtualization
With the use of dynamic analysis and adaptive indexing data architects are able to seamlessly accelerate and optimize workloads - resulting in optimal control over performance and cost.