Predicting vulnerable software components through N-gram analysis and statistical feature selection

Yulei Pang; Xiaozhen Xue; Akbar Siami Namin

doi:10.1109/ICMLA.2015.99

Predicting vulnerable software components through N-gram analysis and statistical feature selection

Yulei Pang, Xiaozhen Xue, Akbar Siami Namin

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

62 Scopus citations

Abstract

Vulnerabilities need to be detected and removed from software. Although previous studies demonstrated the usefulness of employing prediction techniques in deciding about vulnerabilities of software components, the accuracy and improvement of effectiveness of these prediction techniques is still a grand challenging research question. This paper proposes a hybrid technique based on combining N-gram analysis and feature selection algorithms for predicting vulnerable software components where features are defined as continuous sequences of token in source code files, i.e., Java class file. Machine learning-based feature selection algorithms are then employed to reduce the feature and search space. We evaluated the proposed technique based on some Java Android applications, and the results demonstrated that the proposed technique could predict vulnerable classes, i.e., software components, with high precision, accuracy and recall.

Original language	English
Title of host publication	Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	543-548
Number of pages	6
ISBN (Electronic)	9781509002870
DOIs	https://doi.org/10.1109/ICMLA.2015.99
State	Published - Mar 2 2016
Event	IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015 - Miami, United States Duration: Dec 9 2015 → Dec 11 2015

Publication series

Name	Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015

Conference

Conference	IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015
Country/Territory	United States
City	Miami
Period	12/9/15 → 12/11/15

Keywords

Feature selection
N-gram
Vulnerability prediction
Wilcoxon test

Access to Document

10.1109/ICMLA.2015.99

Cite this

Pang, Y., Xue, X., & Namin, A. S. (2016). Predicting vulnerable software components through N-gram analysis and statistical feature selection. In Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015 (pp. 543-548). Article 7424372 (Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICMLA.2015.99

Pang, Yulei ; Xue, Xiaozhen ; Namin, Akbar Siami. / Predicting vulnerable software components through N-gram analysis and statistical feature selection. Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 543-548 (Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015).

@inproceedings{17dc72ee38f242738487c547e1159220,

title = "Predicting vulnerable software components through N-gram analysis and statistical feature selection",

abstract = "Vulnerabilities need to be detected and removed from software. Although previous studies demonstrated the usefulness of employing prediction techniques in deciding about vulnerabilities of software components, the accuracy and improvement of effectiveness of these prediction techniques is still a grand challenging research question. This paper proposes a hybrid technique based on combining N-gram analysis and feature selection algorithms for predicting vulnerable software components where features are defined as continuous sequences of token in source code files, i.e., Java class file. Machine learning-based feature selection algorithms are then employed to reduce the feature and search space. We evaluated the proposed technique based on some Java Android applications, and the results demonstrated that the proposed technique could predict vulnerable classes, i.e., software components, with high precision, accuracy and recall.",

keywords = "Feature selection, N-gram, Vulnerability prediction, Wilcoxon test",

author = "Yulei Pang and Xiaozhen Xue and Namin, {Akbar Siami}",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.; IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015 ; Conference date: 09-12-2015 Through 11-12-2015",

year = "2016",

month = mar,

day = "2",

doi = "10.1109/ICMLA.2015.99",

language = "English",

series = "Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "543--548",

booktitle = "Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015",

}

Pang, Y, Xue, X & Namin, AS 2016, Predicting vulnerable software components through N-gram analysis and statistical feature selection. in Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015., 7424372, Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015, Institute of Electrical and Electronics Engineers Inc., pp. 543-548, IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015, Miami, United States, 12/9/15. https://doi.org/10.1109/ICMLA.2015.99

Predicting vulnerable software components through N-gram analysis and statistical feature selection. / Pang, Yulei; Xue, Xiaozhen; Namin, Akbar Siami.
Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015. Institute of Electrical and Electronics Engineers Inc., 2016. p. 543-548 7424372 (Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Predicting vulnerable software components through N-gram analysis and statistical feature selection

AU - Pang, Yulei

AU - Xue, Xiaozhen

AU - Namin, Akbar Siami

PY - 2016/3/2

Y1 - 2016/3/2

N2 - Vulnerabilities need to be detected and removed from software. Although previous studies demonstrated the usefulness of employing prediction techniques in deciding about vulnerabilities of software components, the accuracy and improvement of effectiveness of these prediction techniques is still a grand challenging research question. This paper proposes a hybrid technique based on combining N-gram analysis and feature selection algorithms for predicting vulnerable software components where features are defined as continuous sequences of token in source code files, i.e., Java class file. Machine learning-based feature selection algorithms are then employed to reduce the feature and search space. We evaluated the proposed technique based on some Java Android applications, and the results demonstrated that the proposed technique could predict vulnerable classes, i.e., software components, with high precision, accuracy and recall.

AB - Vulnerabilities need to be detected and removed from software. Although previous studies demonstrated the usefulness of employing prediction techniques in deciding about vulnerabilities of software components, the accuracy and improvement of effectiveness of these prediction techniques is still a grand challenging research question. This paper proposes a hybrid technique based on combining N-gram analysis and feature selection algorithms for predicting vulnerable software components where features are defined as continuous sequences of token in source code files, i.e., Java class file. Machine learning-based feature selection algorithms are then employed to reduce the feature and search space. We evaluated the proposed technique based on some Java Android applications, and the results demonstrated that the proposed technique could predict vulnerable classes, i.e., software components, with high precision, accuracy and recall.

KW - Feature selection

KW - N-gram

KW - Vulnerability prediction

KW - Wilcoxon test

UR - http://www.scopus.com/inward/record.url?scp=84969673989&partnerID=8YFLogxK

U2 - 10.1109/ICMLA.2015.99

DO - 10.1109/ICMLA.2015.99

M3 - Conference contribution

AN - SCOPUS:84969673989

T3 - Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015

SP - 543

EP - 548

BT - Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015

Y2 - 9 December 2015 through 11 December 2015

ER -

Pang Y, Xue X, Namin AS. Predicting vulnerable software components through N-gram analysis and statistical feature selection. In Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015. Institute of Electrical and Electronics Engineers Inc. 2016. p. 543-548. 7424372. (Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015). doi: 10.1109/ICMLA.2015.99

Predicting vulnerable software components through N-gram analysis and statistical feature selection

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this