TY - GEN
T1 - Examination of data, rule generation and detection of phishing URLs using online logistic regression
AU - Feroz, Mohammed Nazim
AU - Mengel, Susan
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014
Y1 - 2014
N2 - Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.
AB - Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.
KW - Attribute Evaluation
KW - Decision Tree
KW - Feature Vector
KW - Rule Generation
KW - Stochastic Gradient Descent
UR - http://www.scopus.com/inward/record.url?scp=84985994461&partnerID=8YFLogxK
U2 - 10.1109/BigData.2014.7004239
DO - 10.1109/BigData.2014.7004239
M3 - Conference contribution
AN - SCOPUS:84985994461
T3 - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
SP - 241
EP - 250
BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
A2 - Chang, Wo
A2 - Huan, Jun
A2 - Cercone, Nick
A2 - Pyne, Saumyadipta
A2 - Honavar, Vasant
A2 - Lin, Jimmy
A2 - Hu, Xiaohua Tony
A2 - Aggarwal, Charu
A2 - Mobasher, Bamshad
A2 - Pei, Jian
A2 - Nambiar, Raghunath
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 October 2014 through 30 October 2014
ER -