机器学习&恶意代码检测简介

Malware detection

可执行文件简介
检测方法概述
资源及参考文献

可执行文件简介

ELF（Executable Linkable Format）

linux下的可执行文件格式，按照ELF格式编写的文件包括：.so、.a等
PE（Portable Executable）

windows下的可执行文件格式，按照PE格式编写的文件包括: .dll、.lib、.exe等

参考文献【3】中对ELF的各个字段作了详细介绍

Linux和Windows可执行文件分类：

ELF文件类型	说明	实例
可重定位文件（Relocatable File）	包含了代码与数据，可以用来连接成可执行文件或共享目标文件，如目标文件与静态链接库	Linux的.o与.a，Windows的.obj与.lib
共享目标文件（Shared Object File）	包含了代码和数据，主要有两种用途，一是与目标文件或其它共享目标文件链接成新的共享目标文件，二是与可执行文件结合，作为进程映像的一部分来运行	Linux的.a，Windows的.dll
可执行文件（Executable File）	包含了可直接执行的程序	Linux下无后缀的ELF可执行文件，Windows的.exe文件
核心转储文件（Core Dump File）	当进程意外终止时，系统可以将该进程的地址空间的内容及终止时的一些其他信息转储到核心转储文件	Linux下的core dump

检测方法概述

可以初步分为两类 [4,5]：

静态检测：不需要执行程序（如直接使用反汇编后的代码）

滑动窗口字节或熵统计、PE头IAT特征、可打印字符、PE元信息等
动态检测：在沙箱执行程序

API调用序列、系统调用序列、api图 [6]

静态检测方法：

基于OPcode、二进制代码，使用NLP方法
基于二进制代码，生成灰度图，使用图像方法
字节直方图或字节熵直方图
反汇编为汇编语言，使用NLP方法
反汇编后得到API调用序列，根据API调用序列使用NLP方法或图论
直接根据源恶意程序中的可打印字符串或者图片

动态检测方法：

在沙箱运行，得到动态执行数据（进程详情），根据API调用序列使用NLP方法或图论

优缺点：

静态不需要运行，快速、安全；对加密、混淆程序检测不适用

动态需要执行，不太安全，但可以应对加密、混淆程序检测

公开数据集:

EMBER[15]

SoReL-20M[16]

资源及参考文献

【1】PE／ELF／Mach-O之比较. https://www.jianshu.com/p/21850560caf0

【2】认识目标文件的格式——a.out COFF PE ELF

【3】ELF文件格式学习总结. https://www.cnblogs.com/sayhellowen/p/802b5b0ad648e1a343dcd0f85513065f.html

【4】基于机器学习的恶意代码检测技术详解. https://blog.csdn.net/Eastmount/article/details/120421043

【5】Survey of Machine Learning Techniquesfor Malware Analysis. 2019

【6】graph2vec: Learning Distributed Representations of Graphs

【7】Malware Images: Visualization and Automatic Classification. https://vision.ece.ucsb.edu/sites/default/files/publications/nataraj_vizsec_2011_paper.pdf

【8】在线分析工具的重新评测. https://www.52pojie.cn/thread-871410-1-1.html

【9】阿里云恶意程序检测https://tianchi.aliyun.com/competition/entrance/231694/information

【10】https://xz.aliyun.com/t/3106, https://xz.aliyun.com/t/3704

【11】Handling webshell attacks: A systematic mapping and survey. 2021

【12】Detecting unknown malicious code by applying classification techniques on OpCode patterns. https://security-informatics.springeropen.com/track/pdf/10.1186/2190-8532-1-1.pdf

【13】DataCon2020 恶意代码分析.https://zhuanlan.zhihu.com/p/187535672

【14】恶意代码分类. http://blog.moonsea.ac.cn/Malware-Classification. http://drops.xmd5.com/static/drops/tips-8151.html. http://blog.moonsea.ac.cn/Malcode-ngram-opcode

【15】EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. https://arxiv.org/pdf/1804.04637.pdf

【16】SOREL-20M: A LARGE SCALE BENCHMARK DATASET FOR MALICIOUS PE DETECTION. https://arxiv.org/pdf/2012.07634.pdf

【17】Novel Feature Extraction, Selection and Fusion forEffective Malware Family Classification. 2016

github

https://github.com/malicialab/avclass

https://github.com/elastic/ember

https://github.com/sophos-ai/SoReL-20M

android malware

DL-Droid: Deep learning based android malware detection using real devices

DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket

API

MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics

MAAR: Robust features to detect malicious activity based on API calls, their arguments and return values

API GRAPH

Dynamic Android Malware Classification Using Graph-Based Representations

behaviour

Behavior-based anomaly detection on big data https://ro.ecu.edu.au/cgi/viewcontent.cgi?article=1182&context=ism

AMAL: High-fidelity, behavior-based automated malware analysis and classification https://www.sciencedirect.com/science/article/pii/S0167404815000425#bib37

DTB-IDS: an intrusion detection system based on decision tree using behavior analysis for preventing APT attacks https://link.springer.com/content/pdf/10.1007/s11227-015-1604-8.pdf

Transformer

Bert恶意软件. https://www.cnblogs.com/bitterz/p/14000826.html. https://github.com/bitterzzZZ/Bert-malware-classification

Malware Detection on Highly Imbalanced Data through Sequence Modeling

MALBERT: USING TRANSFORMERS FOR CYBERSECURITY AND MALICIOUS SOFTWARE DETECTION. https://arxiv.org/pdf/2103.03806.pdf

PalmTree: Learning an Assembly Language Model for Instruction Embedding. CCS 2021

GENTAL: GENERATIVE DENOISING SKIP-GRAM TRANSFORMER FOR UNSUPERVISED BINARY CODE SIMILARITY DETECTION. ICLR 2022

Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization