Skip to main content

How Spectra Assure analyzes software

All Spectra Assure products are powered by ReversingLabs Spectra Core - the engine that analyzes every software package you scan with the CLI, upload to the Portal, or push to the Spectra Assure Docker images and CI/CD integrations.

The process of analyzing software involves several steps, and the final output are the analysis reports. To better understand the source and significance of the information contained in those reports, it's helpful to learn what Spectra Core does in the background of Spectra Assure products.

This page provides an overview of the Spectra Assure analysis process and explains what happens with files in each of the analysis steps.

The following main steps have dedicated sections where they are described in detail:

  1. Identification
  2. Unpacking
  3. Validation
  4. Metadata processing
  5. Classification
  6. Policy controls

Automated static analysis​

When you scan a software package with Spectra Assure, the engine performs automated static analysis of every file contained in the software package. Automated static analysis is also referred to as complex binary analysis. This unique approach to software analysis decomposes files, collects their metadata, and classifies them in terms of the security risk they pose to end-users. Files are analyzed recursively, which means that every file extracted from the software package goes through the same analysis process like its container software package.

As implemented in Spectra Assure, automated static analysis does not require access to the source code (like SAST tools typically do). It can directly examine compiled software binaries to determine their structure, dependencies and behaviors. In addition to analyzing software binaries (which is the primary use-case), Spectra Assure can analyze library code and source code for specific scripting languages. More details on this are available on the community and language coverage page.

Another benefit of automated static analysis is that files are not executed during the analysis process. All available data is extracted even if the files are compressed, executable, or damaged - regardless of their target OS or platform. Because the analysis process does not execute any files, it can be completed in milliseconds and performed on very large files without significant performance penalties.

All these features of automated static analysis give Spectra Assure a unique advantage - it can analyze post-build artifacts and detect more novel, sophisticated software supply chain attacks than SCA tools are able to. SCA tools typically analyze package managers, manifest files, or source code repositories to find vulnerabilities. They are limited by the need for known signatures of open source dependencies that have to be cross-referenced against a vulnerability database. Being used in pre-build environments, SCA tools lack visibility into deep file structures and build process tampering evidence - insights that Spectra Assure readily provides.

The Spectra Assure analysis process​

The process starts with the input file - the software package that is being analyzed. The analysis engine performs several distinct steps on every object it extracts from the input file.

The following diagram illustrates the flow that every object goes through. You can interact with the diagram to learn more about the process:



Get input fileFile for analysis is acquiredfrom a Spectra Assure productor integration.Identify file formatEngine determines the fileformat using signatures ormachine learning models.Unpack componentsEngine extracts all filecontents with a dedicated fileformat unpacker.Validate structureEngine performs certificateand file integrity checks forthe file format.Process metadataExtracted and collectedmetadata is analyzed andtransformed into file behaviordescriptions.Classify fileMultiple ReversingLabstechnologies are used toproduce a classificationverdict for the file.Apply policiesBuilt-in and user-definedSpectra Assure policies checkfor software quality andsecurity issues.

1. Identification​

Format identification is the initial step of the Spectra Assure analysis process. To successfully perform the subsequent analysis steps, we first need to know the file format of every object we are analyzing.

Specifically, this step analyzes the object structure to determine whether it's binary or text, and assigns the analyzed object a unique file format description. This description - file format identification - instructs the analysis engine on which rules and modules to use for further file processing.

Two main approaches are used for format identification:

  • Signatures - created by ReversingLabs researchers to identify binary file formats based on their unique features. For example, Windows .exe files start with bytes "MZ", while PNG files will usually start with "‰PNG". Signatures describe expectations of what a file format should contain. Using heuristics, the analysis process checks whether those expectations align with the actual file structure. In addition to signatures, the analysis process also evaluates any relevant YARA rules (built into the engine as well as user-provided). If there are multiple matches, those from signatures take priority over YARA rule matches.

  • Machine learning models - created and trained by ReversingLabs researchers to identify textual file formats based on statistical text identification. The models are able to recognize basic text objects as scripting languages and distinguish software source code from other types of textual content.

βœ… Completing the identification step

The results of the format identification step are:

  • File hashes - calculated by the analysis engine
  • File format descriptions - represented as File type.File subtype.Identification (for example, Binary/Archive/ZIP). If there are multiple versions of a file format, they can be identified through the additional version field.

After the format has been identified, the file is either directed to the proper unpacking module according to its signature, or to the validation step.

2. Unpacking​

Unpacking, also referred to as file decomposition, is a step in the Spectra Assure analysis process where the analyzed file is taken apart to extract all available components and metadata.

During the unpacking process, the analysis engine eliminates obfuscation, encryption, compression, and any other protections that may have been applied to the file and its contents. The engine has built-in mechanisms to prevent infinite recursion, and supports configuring the decompression ratio and unpacking depth (how many layers of a file to extract).

Different file formats require different unpacking approaches because of their structure and complexity. Because static analysis does not execute a file, it requires unpackers - specialized tools for parsing and unpacking individual file formats. ReversingLabs develops in-house static unpackers tailored to specific file formats, and Spectra Assure relies on those unpackers during analysis.

Generally speaking, goodware file formats are easier to unpack because their structure is known and well-defined, and file behavior can be observed from the format definition.

File formats commonly used for malware are good at hiding code, which makes their unpacking more challenging. To create an unpacker for malware file formats, researchers have to identify each format and document its structure. The unpacker must be able to simulate file execution so that its code can be reconstructed and its behavior observed. Any obfuscation and protection artifacts must also be removed to allow extracting further objects. Information about the file behavior allows the unpacker - and consequently, the analysis process - to reveal the original software intent and to let users understand the true meaning of the code that was packed in that particular file format.

The ability to unpack a file format makes it possible for the Spectra Assure analysis engine to extract a wealth of metadata and critical information often not available from other tools. The collected metadata includes but is not limited to: format header details, strings (including secrets and URIs), function names, library dependencies, and file segments.

Unpacking greatly increases the surface that can be analyzed and helps file classification by providing more metadata to look at. This makes it easier to confirm classification verdicts and increases the chance to catch every threat.

βœ… Completing the unpacking step

After the file has been successfully unpacked, all collected metadata and the unpacked file content are passed to the validator assigned to the file format. The validator then performs integrity checks on the available data.

3. Validation​

Validation is a step in the Spectra Assure analysis process where the structure and the digital signatures of the analyzed file are verified according to specific criteria for each file format.

In the validation step, the previously identified file format is checked against its specification (the formal definition of the file format by its designer). In other words, the validation process looks for differences between the file format specification and its implementation. By doing this, we can gather additional information about the file format and detect anomalies in it.

Any malformations that violate the file format specification are further examined to determine if they are capable of triggering potentially malicious behavior. Such malformations may be reported as known vulnerabilities. ReversingLabs uses these malformation patterns to create heuristics for potential future exploits and predictive vulnerability detection.

Multiple validators may be used to verify a file format. They are called successively, first to last, or until one of them acknowledges that it recognizes and can handle the specific file format. If validation fails for one of them, the entire file is marked as invalid. Detected issues are reported as validation warnings or errors, depending on their severity.

In addition to performing integrity checks of the file format structure, the validation step also verifies any digital certificates that have been used for code signing. Depending on its status, a certificate may influence the classification of files signed with it. The validation step assigns one of the following statuses to every detected certificate:

  • Valid certificate
  • Invalid certificate
  • Bad checksum
  • Bad signature
  • Malformed certificate
  • Self-signed certificate
  • Impersonation attempt
  • Expired certificate
  • Untrusted certificate
  • Revoked certificate
βœ… Completing the validation step

After the file has been validated, all collected metadata is processed, evaluated, and transformed into actionable information that can be used to deliver the final file classification.

4. Metadata processing​

Metadata processing is a step in the Spectra Assure analysis process where all previously collected metadata is translated into human-readable, explainable information. That information is used to produce or support the final file classification. Most of it is surfaced in Spectra Assure analysis reports (primarily in the SAFE report).

In this step, metadata is converted into capabilities and indicators. They build up on the file format properties and platform-specific features of the analyzed file to describe software behavior and intent in more detail. The goal is to make it clearer what the analyzed code means and what each object is trying to do.

Indicators​

Indicators can be described as behavior markers that are triggered when a specific pattern is found in the collected metadata or in the file content. An indicator may be triggered for multiple reasons. While some indicators can only be found in specific file formats, most are universal and therefore generally applicable.

Indicators contribute to the final file classification, but not in an equal measure. Those deemed highly relevant are better at describing the detected malware type, while those with less relevant contributions help in solidifying the machine learning detection.

Capabilities​

Based on the indicators triggered on a file, the analysis engine infers that the file exhibits a specific behavior, or that it is capable of performing specific actions. Similar software behaviors are grouped into broader categories - capabilities - according to the features they have in common.

For example, a file can have the filesystem capability, which is a broad description that says the file can access the filesystem or perform filesystem operations, but doesn't describe which operation will actually take place. More fine-grained software behavior descriptions are derived from the indicators (e.g. "Accesses the httpd.conf file").

Tags​

The metadata processing step also assigns tags to files based on their properties such as certificate information, software behaviors, file contents, and many more. Some tags can only be applied to specific file types (for example, web browsers or mobile applications).

Tags are visible in the SAFE report for all unpacked files and for URIs in the Networking section of the report, where they can be used for filtering. The Spectra Assure CLI policy controls allow refining the processing filters by targeting specific tags.

βœ… Completing the metadata processing step

After the metadata has been fully processed, the file receives its classification status in the next step of the analysis.

5. Classification​

Classification is a step in the Spectra Assure analysis process where the analysis engine produces a verdict on whether the analyzed file contains threats harmful to the end-user.

Multiple technologies are used for file classification:

  • format identification
  • signatures (byte pattern matches)
  • file structure validation
  • extracted file hierarchy
  • file similarity (RHA1)
  • certificates
  • machine learning
  • heuristics (for scripts and fileless malware)
  • YARA rules included in the analysis engine

They are shipped with the analysis engine and can be used offline, without connecting to any external sources. Their coverage varies based on threat and file format type. In other words, not all technologies can detect all threat types, and not all of them work on all file formats.

Those default classification abilities of the Spectra Assure platform can be extended with threat intelligence from the ReversingLabs Cloud to retrieve file reputation information, and with custom YARA rules for user-assisted classification.

Some classification approaches are more specific than others, with signatures being the most specific. The final classification result relies on the information from all analysis steps, and it is a combination of all technologies applicable to the file format. It will always match one of the technologies even though they may have differing results between them. Because of differences in how malicious files and malware families behave, some files might end up classified as malicious by one technology, and still be considered goodware by others. This doesn’t negate or diminish the final classification.

Explainable Machine Learning​

Spectra Assure is the first and only solution on the market that relies on Explainable Machine Learning (xAI) for threat detection. Explainable Machine Learning was launched by ReversingLabs in 2020 as a predictive threat detection method that can detect novel malware. It focuses on providing threat analysts with human-readable insights into machine learning-driven classifications.

The goal of ReversingLabs Explainable Machine Learning is to go beyond the basic verdict of "goodware vs malware", and to help analysts understand what type of threat was found, why it was detected, and what to do with it next.

To achieve that, the classification system combines:

  • explainability (by surfacing software behaviors in the form of indicators),
  • relevance (by ranking behaviors based on their contribution to the final verdict),
  • and transparency (by displaying why each software behavior was triggered).

Using natural language to provide clear explanations for classification decisions helps security analysts understand how analyzed software behaves and what malware is capable of doing to the system. This transparency fosters trust, facilitates informed decision-making, and makes the logic behind machine learning classification verdicts easier to follow.

Over the years, ReversingLabs threat analysts and researchers have carefully transformed raw code and metadata produced by static analysis into indicators - descriptions of software intent.

Those indicators are used in training machine learning (ML) models to recognize if a file is malicious based on the described software functionality and behavior. Many of the threats in the training datasets are hand-picked by ReversingLabs experts and fully, correctly labeled so that ML models can learn what constitutes a specific threat type, and distinguish it from other threat types as well as from clean software.

This allows ML models to proactively detect and describe threats - even brand new malware - without the need for additional training. When Spectra Assure scans a file and extracts some indicators from it, ML models can match them against the indicators they have learned to recognize as typical for malware or a specific threat type.

Some indicators are more meaningful in the context of a malware or threat type, so they contribute more to the classification. When the model decides that something is malicious, the decision can be verified through indicators and reasons why they were triggered. This makes the decision more transparent, relevant, and explainable in terms that are familiar to human analysts.

ReversingLabs ML models are tailored to threat types to increase accuracy and continuously improved to boost their resilience.

All classification models can detect if a file is malicious or not. The PE (Portable Executable) malware classifier is also able to provide the information on the detected threat type. The exact threat type indicates higher confidence in the classification result, while threats that get assigned a generic threat type ("Malware") may point to new, emerging malware.

The following ML models are used for malware classification:

  • PE malware classifier - detects if a file is malicious (that covers all the threat types) and if it is a specific malware type (one of Backdoor, Downloader, Infostealer, Keylogger, PUA, Ransomware, Worm)

  • Script classifiers - apply to Text/<script type> files and only classify files as malware without specifying the threat type

  • Python malware classifier

  • AutoIt malware classifier

  • Excel4 malware classifier

  • PowerShell malware classifier

  • VBA malware classifier

When a malware detection is made by Explainable Machine Learning, it raises a dedicated policy violation (SQ30108).

Classification propagation​

When classifying files, the analysis engine also considers the context: where a file is located in the file hierarchy and what relationships it has to other files in their shared container (software package). This is reflected in classification propagation - a mechanism to classify files based on other files they contain. Classification can propagate through the file hierarchy in two ways:

1 - Child to parent - for example, a ZIP file contains a malicious EXE, so the ZIP will be also be malicious.

2 - Parent to child - for example, Microsoft's Malicious Removal Tool contains files that look like they are malicious, but since Microsoft is a trusted publisher, those children will be overriden to goodware in the context of that file thanks to the feature called goodware overrides.

βœ… Completing the classification step

File classification produced in this step consists of the final classification status and the threat name (if detected).

The status can be one of the following: Malicious or Suspicious (threats found); Goodware (clean and trusted); Unknown (no known threats).

The threat name contains the following: the platform targeted by the threat; the threat type; the threat family within the threat type. Threat names conform to the ReversingLabs Malware Naming Standard.

6. Policy controls​

In this step, the engine looks at file content and metadata properties extracted from the analyzed file, and applies policy controls to identify and report software quality issues, anomalous content, and security concerns.

In the context of Spectra Assure, a policy is a set of built-in rules that prescribe how software should behave in order to be considered secure. Policies are created by ReversingLabs experts to surface different types of issues, and organized into policy categories accordingly. Some policies are applicable only to specific platforms, file types or file formats, which means they do not produce relevant output if the analyzed file doesn't match their use-case.

When Spectra Assure detects a file in a software package that violates a policy (breaks its built-in rules), that gets reported as an issue. Some issues are considered more severe than others, and can impact the final build status of the analyzed file.

Policy controls are settings that tell the analysis engine which policies to use during the scan, and specify conditions that should be applied when those policies are triggered. Policy controls are organized into profiles, which can be set up for entire software projects or for specific, individual files. Users can override the default controls to enable or disable specific policies, or suppress detected policy violations on the level of a software package or an individual file contained in the package.

SAFE Levels and SAFE Assessment are Spectra Assure concepts closely related to policies and policy controls. The former - SAFE Levels - are predefined sets of policy controls that let users gradually improve their software supply chain security by making specific policy violations block progress to the next level.

SAFE Assessment is a summary overview of key risks and safety concerns detected in the analyzed software. Those risks are grouped into categories, and specific policies are mapped to each category. When a policy is violated, an issue is reported in its risk category, and the overall risk may increase according to the severity and priority of that issue.

βœ… Completing the policy controls step

The results of this step are included in the analysis reports to show which policies have been violated and which have not been used during analysis.

The analyzed file receives the final CI status (PASS or FAIL) based on the policy controls that have been applied. That CI status indicates how the analyzed file should be treated in the context of the user's CI/CD pipeline and the software project in general. In other words, it tells the user if the analyzed file is safe to release and use, or if they should block the release process because of detected issues.

Every policy violation (issue) included in the analysis report comes with a set of recommended steps for mitigating the issue.