Sophos and ReversingLabs announced the release of SoReL-20M, a database containing 20 million Windows Portable Executable files, including 10 million malware samples.
The SoReL-20M database includes a set of curated and labeled samples and security-relevant metadata that could be used as a training dataset for a machine learning engine used in anti-malware solutions.
The availability of large and well-formed training sets is a major problem for the implementation of machine learning models.
“The Sophos AI team is excited to announce the release of SOREL-20M (Sophos-ReversingLabs – 20 million) – a production-scale dataset containing metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples available for download for the purpose of research on feature extraction to drive industry-wide improvements in security.” reads the post published by Sophos.
SOREL-20M is the first production-scale malware research dataset publicly available released with the intent to accelerate research for malware detection via machine learning.
The experts pointed out that a large number of curated and labeled samples is very expensive and difficult to obtain. The majority of works on malware detection is based on private, internal datasets that could not be shared and that for this reason produce results that cannot be directly compared to each other.
“Unlike image recognition or natural language processing, the area of security has seen much less activity and a relatively slower rate of improvement. A major reason for this is simply the lack of a standard, large-scale, realistic data set that can be easily obtained and tested by a wide range of users, from independent researchers to academic labs to large corporate groups.” continues Sophos.
The dataset contains features for each malware that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries,
Experts also released a set of pre-trained PyTorch (https://pytorch.org/) models and LightGBM (https://github.com/Microsoft/LightGBM) models trained on this dataset. Sophos also released scripts that allow to load and iterate over the data, as well as to load, train, and test the models.
Anyway the public availability of training sets like SoReL-20M could also advantage sophisticated attackers that could use them to create new threats but Sophos pointed out that well-resourced attackers could already have access to easy to use and coste effective malware datasets.
For this reason, is essential to give security researchers this dataset and help them to build a new generation of tools that could be effective for malware detection thanks to metadata released alongside the samples.
“That said, while the introduction of machine learning technologies represents a significant leap forward for threat detection at scale, these systems are only as good as the datasets they have access to.” states the announcement published by Reversinglabs.
“All this data gives our customers a well defined dataset of threat intelligence to leverage in their defenses, and as part of their threat hunting programs, to both block active attacks and search for threats that may otherwise be invisible to the traditional security stack.”
(SecurityAffairs – hacking, SoReL-20M)