- 21 Sep, 2021
- read
AWS Athena
AWS Athena is a serverless interactive analytics offering by Amazon, which focuses on providing a means to query data stored in S3 buckets using SQL syntax.
Optimization
The AWS Athena billing model is primarily calculated from the volume of data scanned by each query. As such it it would behoove us to leverage a means to reduce data scanned; to which there exists a few options. In addition to realized cost savings, optimization techniques also serve to enhance query response performance.
Partitioning
By partitioning your data, you can decrease the amount of data scanned by each query, thus improving performance and reducing cost. AWS provides a decent explanation regarding partitioning data in Athena so I won’t reinvent the wheel.
However I will highlight the ability to perform, parition projection. With the use of partition projection, partition values and locations are calculated from configuration rather than read from a repository (like the AWS Glue Data Catalog.) For tables that are heavily parititoned this provides a performance increase as Athena avoid having to make a call to the AWS Glue Data Catalog. Additionally partition management is simplified as Athena is able to automatically build partitions itself from information provided within the partition projection configuration.
Compression
Secondly, AWS Athena supports reading several compressed data formats. This can be coupled with the compression capabilities inherent to AWS Kinesis Data Firehose.
Use Cases
Given the relatively cheap storage cost of S3 (especially when compared to venues such as EBS), S3 presents itself as an ideal solution for mid to long term log retention, or even scenarios where aggregated log volume may render traditional hot storage options as cost prohibitive.
WAF Logs
AWS provides example guidance on contstructing Athena tables for using with AWS WAF Logs. This can be sufficient for most use cases, however there exists one caveat. In the event you’re evaluating behavior of rules in count mode (ie: nonterminating rules) the template provided by AWS does not reference the “ruleMatchDetails” field within the “nonTerminatingMatchingRules”, this information is useful when discovering why a given rule may have triggered a XSS or SQLi condition.
As an Example:
{"nonTerminatingMatchingRules":
[{
"ruleId":"RuleB-SQLi"
,"action":"COUNT"
,"ruleMatchDetails":
[{
"conditionType":"SQL_INJECTION"
,"location":"HEADER"
,"matchedData":[
"10"
,"and"
,"1"]
}]
}]
,"excludedRules":null
}
Create Table
CREATE EXTERNAL TABLE `waf_logs`(
`timestamp` bigint COMMENT 'from deserializer',
`formatversion` int COMMENT 'from deserializer',
`webaclid` string COMMENT 'from deserializer',
`terminatingruleid` string COMMENT 'from deserializer',
`terminatingruletype` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`terminatingrulematchdetails` array<struct<conditiontype:string,location:string,matcheddata:array<string>>> COMMENT 'from deserializer',
`httpsourcename` string COMMENT 'from deserializer',
`httpsourceid` string COMMENT 'from deserializer',
`rulegrouplist` array<struct<rulegroupid:string,terminatingrule:struct<ruleid:string,action:string,rulematchdetails:string>,nonterminatingmatchingrules:array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>>>,excludedrules:string>> COMMENT 'from deserializer',
`ratebasedrulelist` array<struct<ratebasedruleid:string,limitkey:string,maxrateallowed:int>> COMMENT 'from deserializer',
`nonterminatingmatchingrules` array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>>> COMMENT 'from deserializer',
`requestheadersinserted` string COMMENT 'from deserializer',
`responsecodesent` string COMMENT 'from deserializer',
`httprequest` struct<clientip:string,country:string,headers:array<struct<name:string,value:string>>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string> COMMENT 'from deserializer',
`labels` array<struct<name:string>> COMMENT 'from deserializer')
PARTITIONED BY (
`datestamp` string,
`webaclname` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='action,formatVersion,httpRequest,httpSourceId,httpSourceName,labels,nonTerminatingMatchingRules,rateBasedRuleList,requestHeadersInserted,responseCodeSent,ruleGroupList,terminatingRuleId,terminatingRuleMatchDetails,terminatingRuleType,timestamp,webaclId')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://waflogbucket/'
TBLPROPERTIES (
'projection.datestamp.format'='yyyy/MM/dd/HH',
'projection.datestamp.interval'='1',
'projection.datestamp.interval.unit'='HOURS',
'projection.datestamp.range'='2021/01/01/00,NOW',
'projection.datestamp.type'='date',
'projection.enabled'='true',
'projection.webaclname.type'='injected',
'storage.location.template'='s3://waflogbucket/${webaclname}/${datestamp}')
In this specific example I’m partitioning upon “webaclname” and “datestamp. With this scenario I’m aggregating logs from multiple WAF Web ACLs, where there exists a 1:1 relation between web ACLs and a top level folder within the S3 bucket. Datestamp is in reference to the hourly structure in which WAF logs are organized by, in actuality the file path resembles folders arranged into a nested structure ‘yyyy/mm/dd/hh’.
Example Queries:
This example illustrates selecting datestamp (effectively the hour), whether the request was missing an origin header, browser user-agent string, and cookies sent with the request.
SELECT
datestamp,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'origin'), 1).value IS NULL AS MissingOrigin,
httprequest.clientip,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'user-agent'), 1).value AS UserAgent,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'cookie'), 1).value AS Cookie
from "db"."waf_logs"
where webaclname = 'production-web'
AND datestamp >= '2021/01/01'
AND httprequest.uri = '/uri/path'
AND httprequest.httpmethod = 'POST'
order by 1 desc
This examples illustrates counting and sorting records upon a specific column (User-Agent and URI path) additionally excluding certain http methods and the prescence of an origin request header.
SELECT
count() AS Count,
element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'user-agent'), 1).value AS useragent,
httprequest.uri
from "db"."waf_logs"
where webaclname = 'production-web'
AND httprequest.httpmethod != 'GET'
AND httprequest.httpmethod != 'HEAD'
AND element_at(filter(httprequest.headers, headers -> lower(headers.name) = 'origin'), 1).value IS NULL
AND datestamp >= '2021/01/01'
group by 2,3
ORDER BY 1 desc