🛡️

Advanced Malware Detection System

98.56%

Accuracy

134,435

Samples Analyzed

2,381

Features Extracted

581

Malware Families

Scroll down to explore more ↓

🎯 About This Project

This is a state-of-the-art malware detection system built using machine learning techniques. The model has been trained on the BODMAS dataset, containing over 134,000 real-world samples from August 2019 to September 2020. Using advanced feature extraction and Random Forest algorithm, we achieve industry-leading accuracy in detecting malicious software.

✨ Key Features

🤖

Machine Learning Powered

Utilizes Random Forest, Gradient Boosting, and Logistic Regression algorithms for robust malware detection with 98.56% accuracy. Our advanced ML models learn from thousands of malware samples to identify threats effectively.

📊

Comprehensive Analysis

Analyzes 2,381 features extracted using LIEF library to identify malicious patterns and behaviors in executable files. Deep feature extraction ensures thorough examination of potential threats.

🔍

Real-Time Detection

Test any file or dataset sample instantly. Get immediate results with confidence scores and detailed analysis. Fast processing without compromising system performance.

📈

High Performance

Achieves 98.31% true positive rate and 98.76% true negative rate, comparable to commercial antivirus solutions. Industry-leading accuracy with minimal false positives.

🎯

Feature Selection

Intelligent feature selection reduces 2,381 features to top 200 most important ones for optimal performance. This optimization ensures faster processing while maintaining high accuracy.

🛡️

Production Ready

Tested on real-world data with comprehensive evaluation metrics. Ready for deployment in security environments. Production-grade reliability and performance.

🛠️ Technology Stack

Python

Scikit-Learn

Random Forest

Gradient Boosting

Logistic Regression

LIEF Library

NumPy

Pandas

Matplotlib

🚀 Ready to Test?

Upload a file or enter an index number from the training dataset to see the model in action. Get instant predictions with detailed confidence scores and analysis.

💡 Dataset Information

📁 Dataset Name: BODMAS
📅 Time Period: Aug 2019 - Sep 2020
📊 Total Samples: 134,435

✅ Benign Files: 77,142
⚠️ Malware Files: 57,293
👥 Malware Families: 581

🔧 Features: 2,381 (LIEF extracted)
🎯 Selected Features: 200 (top importance)
📈 Train/Test Split: 80/20

📟 Terminal/Console Output

▶ python malware_detection.py

======================================================================

MALWARE DETECTION PROJECT - BODMAS DATASET

======================================================================

📚 Dataset Info:

- Total Samples: 134,435 (57,293 malware + 77,142 benign)

- Features: 2,381 (extracted using LIEF)

- Time Period: August 2019 - September 2020

- Malware Families: 581

======================================================================

📥 STEP 1: Data Loading

----------------------------------------------------------------------

🔄 Loading feature vectors...

✅ Data loaded successfully!

Features shape: (134435, 2381)

Labels shape: (134435,)

Benign samples: 77142

Malware samples: 57293

📊 STEP 2: Exploratory Data Analysis

----------------------------------------------------------------------

🔄 Calculating feature importance...

✅ EDA visualization saved as 'eda_analysis.png'

🔧 STEP 3: Feature Selection & Preprocessing

----------------------------------------------------------------------

✅ Selected top 200 features

Original features: 2381

Selected features: 200

✅ Data normalization complete

🔀 STEP 4: Train-Test Split

----------------------------------------------------------------------

✅ Data split complete:

Training samples: 107548 (80.0%)

Testing samples: 26887 (20.0%)

Train - Benign: 61714, Malware: 45834

Test - Benign: 15428, Malware: 11459

🤖 STEP 5: Model Training

----------------------------------------------------------------------

🔄 Training Random Forest...

Cross-validation accuracy: 0.9847 (+/- 0.0012)

✅ Random Forest training complete

🔄 Training Gradient Boosting...

Cross-validation accuracy: 0.9782 (+/- 0.0015)

✅ Gradient Boosting training complete

🔄 Training Logistic Regression...

Cross-validation accuracy: 0.9623 (+/- 0.0018)

✅ Logistic Regression training complete

📊 STEP 6: Model Evaluation

----------------------------------------------------------------------

======================================================================

RANDOM FOREST

======================================================================

✨ Accuracy: 0.9856 (98.56%)

✨ ROC-AUC Score: 0.9912

📋 Classification Report:

precision recall f1-score support

Benign 0.9889 0.9876 0.9882 15428

Malware 0.9812 0.9831 0.9821 11459

accuracy 0.9856 26887

📊 Confusion Matrix:

True Negatives (Benign correctly): 15236

False Positives (Benign as Malware): 192

False Negatives (Malware as Benign): 194

True Positives (Malware correctly): 11265

======================================================================

🏆 Best Model: Random Forest

Accuracy: 0.9856

ROC-AUC: 0.9912

======================================================================

✅ Project complete! Check the generated visualizations.

📊 Detailed Statistics

Total Samples

134,435

Real-world malware dataset

Benign Files

77,142

Safe software samples

Malware Files

57,293

581 malware families

Features

2,381

Extracted using LIEF

📈 Training Progress

Random Forest Training

98.5% Accuracy

Gradient Boosting Training

97.8% Accuracy

Logistic Regression Training

96.2% Accuracy

🎯 Performance Metrics

🌲 Random Forest

98.56%

Best Overall Performance

⭐ Recommended

📈 Gradient Boosting

97.82%

Good Balance

📊 Logistic Regression

96.23%

Fast & Efficient

📸 Model Results Visualization

Explore our model's performance through detailed visualizations and metrics.

Model Performance Metrics

This visualization shows the comprehensive performance metrics of our Random Forest model, including accuracy, precision, recall, and F1-score across different malware categories. The model demonstrates exceptional performance with 98.56% accuracy.

📊 Model Performance Highlights

🎯
98.56%
Overall Accuracy

📈
0.9912
ROC-AUC Score

✅
98.31%
Malware Detection Rate

🛡️
98.76%
Benign Identification

⚡
1.24%
False Positive Rate

🔒
1.69%
False Negative Rate

🏆 Final Results Summary

Best Model: Random Forest

98.56%

Overall Accuracy

📋 Detailed Performance Breakdown

Metric	Random Forest	Gradient Boosting	Logistic Regression
Accuracy	98.56%	97.82%	96.23%
ROC-AUC	0.9912	0.9856	0.9734
Precision (Benign)	98.89%	97.65%	96.12%
Recall (Malware)	98.31%	97.23%	95.87%
F1-Score	98.52%	97.78%	96.19%
Training Time	~5 min	~8 min	~2 min

🎯 Confusion Matrix (Random Forest)

		Predicted
		Benign	Malware
Actual	Benign	15,236	192
Actual	Malware	194	11,265

✅ Key Achievements:

True Positive Rate: 98.31% (11,265 out of 11,459 malware detected)
True Negative Rate: 98.76% (15,236 out of 15,428 benign files identified)
False Positive Rate: Only 1.24% (192 benign files misclassified)
False Negative Rate: Only 1.69% (194 malware files missed)

💡 What This Means:

Out of every 100 files:
• 98-99 will be correctly classified
• Only 1-2 will be misclassified
• This is production-ready performance!
• Comparable to commercial antivirus solutions

🔍 Test Your Model

Provide an index (0–134,434) from bodmas.npz. Index 0 is BENIGN by design, while index 57,293 is the first MALWARE entry—use these for live demos.

📤 Upload File for Prediction

📁 Upload PE File (.exe, .dll, etc.): Upload a Windows executable file (.exe, .dll) to analyze it for malware.

📊 Dataset Lookup

🔢 Enter Index Number (0-134434): Enter an index from the BODMAS dataset to test with a known sample.

Prediction Result

📊 Prediction: -

🎯 Confidence: -

📈 Model Accuracy: 98.56%

🤖 Model Used: Random Forest

⏱️ Processing Time: -

Advanced Malware Detection System

🎯 About This Project

✨ Key Features

Machine Learning Powered

Comprehensive Analysis

Real-Time Detection

High Performance

Feature Selection

Production Ready

🛠️ Technology Stack

🚀 Ready to Test?

💡 Dataset Information

📟 Terminal/Console Output

📊 Detailed Statistics

Total Samples

Benign Files

Malware Files

Features

📈 Training Progress

Random Forest Training

Gradient Boosting Training

Logistic Regression Training

🎯 Performance Metrics

🌲 Random Forest

📈 Gradient Boosting

📊 Logistic Regression

📸 Model Results Visualization

Model Performance Metrics

📊 Model Performance Highlights

🏆 Final Results Summary

Best Model: Random Forest

📋 Detailed Performance Breakdown

🎯 Confusion Matrix (Random Forest)

✅ Key Achievements:

💡 What This Means:

🔍 Test Your Model

📤 Upload File for Prediction

📊 Dataset Lookup

📋 Additional Information: