AWS: Athena

  • 21 Sep, 2021
  • read

AWS Athena

AWS Athena is a serverless interactive analytics offering by Amazon, which focuses on providing a means to query data stored in S3 buckets using SQL syntax.

Optimization

The AWS Athena billing model is primarily calculated from the volume of data scanned by each query. As such it it would behoove us to leverage a means to reduce data scanned; to which there exists a few options. In addition to realized cost savings, optimization techniques also serve to enhance query response performance.

Partitioning

By partitioning your data, you can decrease the amount of data scanned by each query, thus improving performance and reducing cost. AWS provides a decent explanation regarding partitioning data in Athena so I won’t reinvent the wheel.

However I will highlight the ability to perform, parition projection. With the use of partition projection, partition values and locations are calculated from configuration rather than read from a repository (like the AWS Glue Data Catalog.) For tables that are heavily parititoned this provides a performance increase as Athena avoid having to make a call to the AWS Glue Data Catalog. Additionally partition management is simplified as Athena is able to automatically build partitions itself from information provided within the partition projection configuration.

Compression

Secondly, AWS Athena supports reading several compressed data formats. This can be coupled with the compression capabilities inherent to AWS Kinesis Data Firehose.

Use Cases

Given the relatively cheap storage cost of S3 (especially when compared to venues such as EBS), S3 presents itself as an ideal solution for mid to long term log retention, or even scenarios where aggregated log volume may render traditional hot storage options as cost prohibitive.

WAF Logs

AWS provides example guidance on contstructing Athena tables for using with AWS WAF Logs. This can be sufficient for most use cases, however there exists one caveat. In the event you’re evaluating behavior of rules in count mode (ie: nonterminating rules) the template provided by AWS does not reference the “ruleMatchDetails” field within the “nonTerminatingMatchingRules”, this information is useful when discovering why a given rule may have triggered a XSS or SQLi condition.

As an Example:

{"nonTerminatingMatchingRules":
        [{
            "ruleId":"RuleB-SQLi"
            ,"action":"COUNT"
            ,"ruleMatchDetails":
            [{
                "conditionType":"SQL_INJECTION"
                ,"location":"HEADER"
                ,"matchedData":[
                    "10"
                    ,"and"
                    ,"1"]
            }]
        }]
        ,"excludedRules":null
}

Create Table

CREATE EXTERNAL TABLE `waf_logs`(
  `timestamp` bigint COMMENT 'from deserializer', 
  `formatversion` int COMMENT 'from deserializer', 
  `webaclid` string COMMENT 'from deserializer', 
  `terminatingruleid` string COMMENT 'from deserializer', 
  `terminatingruletype` string COMMENT 'from deserializer', 
  `action` string COMMENT 'from deserializer', 
  `terminatingrulematchdetails` array<struct<conditiontype:string,location:string,matcheddata:array<string>>> COMMENT 'from deserializer', 
  `httpsourcename` string COMMENT 'from deserializer', 
  `httpsourceid` string COMMENT 'from deserializer', 
  `rulegrouplist` array<struct<rulegroupid:string,terminatingrule:struct<ruleid:string,action:string,rulematchdetails:string>,nonterminatingmatchingrules:array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>>>,excludedrules:string>> COMMENT 'from deserializer', 
  `ratebasedrulelist` array<struct<ratebasedruleid:string,limitkey:string,maxrateallowed:int>> COMMENT 'from deserializer', 
  `nonterminatingmatchingrules` array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>>> COMMENT 'from deserializer', 
  `requestheadersinserted` string COMMENT 'from deserializer', 
  `responsecodesent` string COMMENT 'from deserializer', 
  `httprequest` struct<clientip:string,country:string,headers:array<struct<name:string,value:string>>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string> COMMENT 'from deserializer', 
  `labels` array<struct<name:string>> COMMENT 'from deserializer')
PARTITIONED BY ( 
  `datestamp` string, 
  `webaclname` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='action,formatVersion,httpRequest,httpSourceId,httpSourceName,labels,nonTerminatingMatchingRules,rateBasedRuleList,requestHeadersInserted,responseCodeSent,ruleGroupList,terminatingRuleId,terminatingRuleMatchDetails,terminatingRuleType,timestamp,webaclId') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://waflogbucket/'
TBLPROPERTIES (
  'projection.datestamp.format'='yyyy/MM/dd/HH', 
  'projection.datestamp.interval'='1', 
  'projection.datestamp.interval.unit'='HOURS', 
  'projection.datestamp.range'='2021/01/01/00,NOW', 
  'projection.datestamp.type'='date', 
  'projection.enabled'='true', 
  'projection.webaclname.type'='injected', 
  'storage.location.template'='s3://waflogbucket/${webaclname}/${datestamp}')

In this specific example I’m partitioning upon “webaclname” and “datestamp. With this scenario I’m aggregating logs from multiple WAF Web ACLs, where there exists a 1:1 relation between web ACLs and a top level folder within the S3 bucket. Datestamp is in reference to the hourly structure in which WAF logs are organized by, in actuality the file path resembles folders arranged into a nested structure ‘yyyy/mm/dd/hh’.

Example Queries:

This example illustrates selecting datestamp (effectively the hour), whether the request was missing an origin header, browser user-agent string, and cookies sent with the request.

SELECT 
datestamp,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'origin'), 1).value IS NULL AS MissingOrigin,
httprequest.clientip,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'user-agent'), 1).value AS UserAgent,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'cookie'), 1).value AS Cookie

from "db"."waf_logs"
where webaclname = 'production-web'
AND datestamp >= '2021/01/01' 
AND httprequest.uri = '/uri/path'
AND httprequest.httpmethod = 'POST'
order by 1 desc

This examples illustrates counting and sorting records upon a specific column (User-Agent and URI path) additionally excluding certain http methods and the prescence of an origin request header.

SELECT 
count() AS Count,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'user-agent'), 1).value AS useragent,
httprequest.uri

from "db"."waf_logs"
where webaclname = 'production-web'
AND httprequest.httpmethod != 'GET'
AND httprequest.httpmethod != 'HEAD'
AND element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'origin'), 1).value IS NULL
AND datestamp >= '2021/01/01' 
group by 2,3
ORDER BY 1 desc