How CRC works in Splunk?
In this post we are going to cover one of the Splunk’s vital behind the hood actions, the Cyclic Redundancy Check (CRC); Splunk performs the check before ingesting data.
Splunk can determine if the files it is monitoring (such as /var/log/messages) has been rolled by the operating system (/var/log/messages1) and will not read the rolled file a second time.
The Splunk monitoring processor picks up new files and reads the first 256 bytes of the file by default. After that, the processor hashes this data into a begin and end cyclic redundancy check (CRC), which functions as a fingerprint (unique identity) representing the file content.
Splunk maintains a database of all the beginning CRCs of files it has seen before and uses it to look up any new CRC entry in the database. If successful, the lookup returns a few values, but the important ones are a seekAddress, meaning the number of bytes into the known file that Splunk has already read, and a seekCRC which is a fingerprint of the data at that location.
The results of this lookup help Splunk categorize the file.
There are three possible outcomes of a CRC lookup:
1) The CRC from the file beginning in the database has no matching record, indicating a file that Splunk hasn’t seen before. Splunk picks it up and ingests its data from the start of the file and updates the database with the new CRCs and Seek Addresses as it ingests the file.
2) The record matches for the CRC from the file beginning in the database, the content at the Seek Address location matches the stored CRC for that location in the file, and the size of the file is larger than the Seek Address that Splunk has stored, indicating Splunk has seen the file before, but data has been added since it was last read. Splunk opens the file, seeks to the Seek Address (the end of the file when Splunk last ingested it) and starts reading/ingesting the new contents from that point.
3) The record matches for the CRC from the file beginning in the database, but the content at the Seek Address location does not match the stored CRC at that location in the file. Splunk Enterprise has read some file with the same initial data, but either some of the material that it read has been modified in place, or it is, in fact, an entirely different file which begins with the same content. Because the database for content tracking is keyed to the beginning CRC, it has no way to track progress independently for the two different data streams, and further configuration is required.
NOTE: As the CRC check runs against only the first 256 bytes of the file by default, it is possible for non-duplicate files to have duplicate start CRCs, especially if the files are ones with identical headers.
You can handle such circumstances as follows:
i) Apply the initCrcLength attribute in inputs.conf to increase the number of bytes used for the CRC calculation, and make it longer than your static header.
ii) Apply the crcSalt attribute when configuring the file in inputs.conf. The crcSalt attribute, when set to <SOURCE>, ensures that each file has a unique CRC. The effect of this attribute/value pair is that Splunk assumes that each source/pathname contains unique content.
CAUTION: Never use crcSalt = <SOURCE> with rolling log files, or in the cases when log files get renamed or moved to another monitored location, this stops Splunk from recognizing log files across the roll or rename, which results in the data being reindexed.
Hope!! you liked this post.
Thanks. We’ve been working on some crcsalt applications and we were wondering (curious if anyone knows)… Docs say that crcSalt value (string) is “added to CRC”. I/we are guessing that the crcsalt string is simply appended to the first 256 bytes of the file (or whatever initcrclength is set to) before the CRC computation is applied. Or is it something more complex than that? (I.e., does some form of key stretching or other modification to either string occur before the “adding”?) Also, does anyone know if crcsalt is “added” by appending to the end of the first X bytes of the file — or by pre-pending to the first X bytes?
Thanks for shooting your query through this blog . Please find below the detailed answer to this :
What we understand the Splunk input processor treats the content from the file for which you have used the attribute crcSalt to be unique, so, it doesn’t do lookup with the CRC Database and thus keeps itself away from any kind of CRC conflicts.
A blog excerpt :
“The crcSalt attribute, when set to, ensures that each file has a unique CRC. The effect of this attribute/value pair is that Splunk assumes that each source/path name contains unique content.”
Thank you for the detailed explanation !