Optimising Logstash Grok Patterns

Recently while writing some Grok patterns for the ELK cluster at work, I had a conversation with my colleage Berg about the efficiency of Grok patterns. I was working with Cisco ASA firewall logs, which are natural language, human readable logs, in contrast with IPtables or Palo Alto logs which are sent in a CSV format. In order to parse them into structured log events to feed into Elasticsearch, I used the Grok filter.

Here's a (particularly complex) example of a log entry, from the Cisco documentation:

%ASA-4-313005: No matching connection for ICMP error message: icmp_msg_info on interface_name interface. Original IP payload: embedded_frame_info icmp_msg_info = icmp src src_interface_name : src_address [([ idfw_user | FQDN_string ], sg_info)] dst dest_interface_name : dest_address [([ idfw_user | FQDN_string ], sg_info)] (type icmp_type, code icmp_code) embedded_frame_info = prot src source_address / source_port [([ idfw_user | FQDN_string ], sg_info)] dst dest_address / dest_port [( idfw_user | FQDN_string), sg_info ]

As you can see, there's a lot of data to sort through, and a lot of optional fields. My Grok patterns ended up being pretty complex. But that's for another blog post. My main concern was that I was nearing 50 separate grok patterns to handle all the different events that were being produced by the firewalls. (There are obviously a lot more in the documentation, but a lot of them are feature-specific). And due to the way the logs are presented to logstash, we basically have to run every log event against every pattern. The config looks something like this:

if [type] == "cisco-asa-fw" {
    filter {
        grok {
            match => {
                "message" =>
                    "%{CISCOFW104001}",
                    "%{CISCOFW104002}",
                    "%{CISCOFW104003}",
                    "%{CISCOFW104004}",
                    "%{CISCOFW105003}",
                    ...
            }
        }
    }
}

Processing events through all of these patterns (which are, to the best of my knowledge, glorified regexes) should take a fairly large amount CPU time, right?

Taking a quick look at the Grok filter documentation shows that by default, the plugin will attempt to match each event against every configured pattern, breaking when it finds a successful match. So depending on the order of patterns in the config, each event might have up to 50 patterns tested against it, or it could match to the first or second.

After seeing this, I started to consider how best to order the patterns for maximum efficiency. I should put the most popular patterns at the top, right? But would it make a difference...?

The first step was to determine which patterns were the most popular. I put together the following perl snippet and ran it over a few day's worth of ASA logs that had been processed to keep just the log message ID. The total size of the log was roughly 80 million lines. I was using good old sort | uniq -c at first, but realised that sorting so many lines would take forever.

perl -ne '
     $count{$_}++;
     END {
         print "$count{$_} $_" for sort {
             $count{$b} <=> $count{$a} || $b cmp $a
         } keys %count
     }
' big.log

and the result:

73504640 %ASA-4-106023
2536204 %ASA-6-302014
1588292 %ASA-6-302016
1460910 %ASA-6-302013
1243479 %ASA-6-302015
1187433 %ASA-6-302021
1187431 %ASA-6-302020
924327 %ASA-4-313005
842554 %ASA-4-419002
114544 %ASA-3-710003
64095 %ASA-6-106015
15153 %ASA-5-746015
15151 %ASA-5-746014
14324 %ASA-4-313004
...

over 90% of our events were for the same event ID (incidentally, it's a log event for a packet getting blocked by an ACL. no wonder we had so many...). I checked the config and the pattern was near the top, since the patterns were listed in ascending order. However, it should probably be moved to the first position in the list.

I set up some quick benchmarking, a simple logstash config to process log events without any output, and used the metrics plugin to determine the processing rate. Then I fed a week's worth of logs into the logstash instance, and recorded the processing rate after it stabilised, which took ~3 minutes. I ran the test 3 times.

rate: 16625.322893111657 total: 4940453

rate: 15958.912346442401 total: 4738993

rate: 16286.595263190859 total: 4873216

(16625 + 15958 + 16286) / 3 = 16289 events per second.

After reordering the grok patterns in the config to match their frequency in our logs, I re-ran the test:

rate: 17719.69440467573 total: 5319454

rate: 18118.21670125316 total: 5292698

rate: 17571.606762173426 total: 5261024

(17719 + 18118 + 17571) / 3 = 17802 events per second.

((17802 - 16289) / 16289) * 100 = a 9.2% increase in processing rate.

So, not an amazing increase, but i'll take free performance increases where I can.

Thanks again to Berg Lloyd-Haig for pushing me to write this post, and getting me to fix up a bunch of the logs that my team feeds to his ELK instance.

Comments