Examination of data, rule generation and detection of phishing URLs using online logistic regression

Mohammed Nazim Feroz; Susan Mengel

doi:10.1109/BigData.2014.7004239

Examination of data, rule generation and detection of phishing URLs using online logistic regression

Mohammed Nazim Feroz, Susan Mengel

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

31 Scopus citations

Abstract

Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.

Original language	English
Title of host publication	Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
Editors	Wo Chang, Jun Huan, Nick Cercone, Saumyadipta Pyne, Vasant Honavar, Jimmy Lin, Xiaohua Tony Hu, Charu Aggarwal, Bamshad Mobasher, Jian Pei, Raghunath Nambiar
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	241-250
Number of pages	10
ISBN (Electronic)	9781479956654
DOIs	https://doi.org/10.1109/BigData.2014.7004239
State	Published - 2014
Event	2nd IEEE International Conference on Big Data, IEEE Big Data 2014 - Washington, United States Duration: Oct 27 2014 → Oct 30 2014

Publication series

Name	Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

Conference

Conference	2nd IEEE International Conference on Big Data, IEEE Big Data 2014
Country/Territory	United States
City	Washington
Period	10/27/14 → 10/30/14

Keywords

Attribute Evaluation
Decision Tree
Feature Vector
Rule Generation
Stochastic Gradient Descent

Access to Document

10.1109/BigData.2014.7004239

Cite this

Feroz, M. N., & Mengel, S. (2014). Examination of data, rule generation and detection of phishing URLs using online logistic regression. In W. Chang, J. Huan, N. Cercone, S. Pyne, V. Honavar, J. Lin, X. T. Hu, C. Aggarwal, B. Mobasher, J. Pei, & R. Nambiar (Eds.), Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014 (pp. 241-250). Article 7004239 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2014.7004239

Feroz, Mohammed Nazim ; Mengel, Susan. / Examination of data, rule generation and detection of phishing URLs using online logistic regression. Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. editor / Wo Chang ; Jun Huan ; Nick Cercone ; Saumyadipta Pyne ; Vasant Honavar ; Jimmy Lin ; Xiaohua Tony Hu ; Charu Aggarwal ; Bamshad Mobasher ; Jian Pei ; Raghunath Nambiar. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 241-250 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014).

@inproceedings{1db680a062ad4870977589c34d7ea4f7,

title = "Examination of data, rule generation and detection of phishing URLs using online logistic regression",

abstract = "Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.",

keywords = "Attribute Evaluation, Decision Tree, Feature Vector, Rule Generation, Stochastic Gradient Descent",

author = "Feroz, {Mohammed Nazim} and Susan Mengel",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.; 2nd IEEE International Conference on Big Data, IEEE Big Data 2014 ; Conference date: 27-10-2014 Through 30-10-2014",

year = "2014",

doi = "10.1109/BigData.2014.7004239",

language = "English",

series = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "241--250",

editor = "Wo Chang and Jun Huan and Nick Cercone and Saumyadipta Pyne and Vasant Honavar and Jimmy Lin and Hu, {Xiaohua Tony} and Charu Aggarwal and Bamshad Mobasher and Jian Pei and Raghunath Nambiar",

booktitle = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",

}

Feroz, MN & Mengel, S 2014, Examination of data, rule generation and detection of phishing URLs using online logistic regression. in W Chang, J Huan, N Cercone, S Pyne, V Honavar, J Lin, XT Hu, C Aggarwal, B Mobasher, J Pei & R Nambiar (eds), Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014., 7004239, Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014, Institute of Electrical and Electronics Engineers Inc., pp. 241-250, 2nd IEEE International Conference on Big Data, IEEE Big Data 2014, Washington, United States, 10/27/14. https://doi.org/10.1109/BigData.2014.7004239

Examination of data, rule generation and detection of phishing URLs using online logistic regression. / Feroz, Mohammed Nazim; Mengel, Susan.
Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. ed. / Wo Chang; Jun Huan; Nick Cercone; Saumyadipta Pyne; Vasant Honavar; Jimmy Lin; Xiaohua Tony Hu; Charu Aggarwal; Bamshad Mobasher; Jian Pei; Raghunath Nambiar. Institute of Electrical and Electronics Engineers Inc., 2014. p. 241-250 7004239 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Examination of data, rule generation and detection of phishing URLs using online logistic regression

AU - Feroz, Mohammed Nazim

AU - Mengel, Susan

PY - 2014

Y1 - 2014

N2 - Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.

AB - Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.

KW - Attribute Evaluation

KW - Decision Tree

KW - Feature Vector

KW - Rule Generation

KW - Stochastic Gradient Descent

UR - http://www.scopus.com/inward/record.url?scp=84985994461&partnerID=8YFLogxK

U2 - 10.1109/BigData.2014.7004239

DO - 10.1109/BigData.2014.7004239

M3 - Conference contribution

AN - SCOPUS:84985994461

T3 - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

SP - 241

EP - 250

BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

A2 - Chang, Wo

A2 - Huan, Jun

A2 - Cercone, Nick

A2 - Pyne, Saumyadipta

A2 - Honavar, Vasant

A2 - Lin, Jimmy

A2 - Hu, Xiaohua Tony

A2 - Aggarwal, Charu

A2 - Mobasher, Bamshad

A2 - Pei, Jian

A2 - Nambiar, Raghunath

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2nd IEEE International Conference on Big Data, IEEE Big Data 2014

Y2 - 27 October 2014 through 30 October 2014

ER -

Feroz MN, Mengel S. Examination of data, rule generation and detection of phishing URLs using online logistic regression. In Chang W, Huan J, Cercone N, Pyne S, Honavar V, Lin J, Hu XT, Aggarwal C, Mobasher B, Pei J, Nambiar R, editors, Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc. 2014. p. 241-250. 7004239. (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014). doi: 10.1109/BigData.2014.7004239

Examination of data, rule generation and detection of phishing URLs using online logistic regression

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this