New Cyber Security study aims to identify malicious websites through their design features
Web Mining utilizes data mining techniques to discover and automatically learn information on the Internet (www). One of the most challenging applications of Web Mining is identifying the type of a specific website based on the category to which it belongs, and in particular identifying malicious websites. Malicious websites are websites created with the intent of harming users, stealing information or conducting other undesirable activities when the user enters the website, and afterwards.
In recent years, many methods have been developed for identifying website categories. Some are based on analyzing the website's textual content, its users' navigating profiles??, suspect traits found in the website itself, IP forgeries and many other properties. At the same time, the already enormous amount of information existing on the Internet continues to rise rapidly and exponentially, making the identification task ever more complex – often requiring a great deal of expensive resources.
A study carried out by Doron Cohen, under the supervision of Prof. Irad Ben-Gal and Prof. Shulamith Kreitler, tested a new method for identifying websites, based on an analysis of their graphic design, via methods of data mining. To this end, an algorithm was constructed, which receives a URL as input, and produces an output by retrieving and processing all graphic design features and storing them as tables in the server. The researchers examined hundreds of home pages of websites taken from Google's top 1000 sites. For each of these, over 1000 graphic design features were examined, such as: size of area covered by each color, font size, number of characters, standard deviations, quantity and type of elements etc.
After processing and analyzing the data, the researchers built a predictive model based on a decision tree, and then cross-validated the model. In the first experiment they discovered that classification based solely on graphic design enables relatively high prediction of all five examined website categories (including that of malicious websites). Another experiment found that adding graphic design features to another objective prediction method can improve the precision of identification – specifically of malicious websites - by 95%, and in a statistically significant manner, using low-cost resources and low runtimes. A possible explanation for these findings is that malicious websites apparently try to conceal keywords to avoid detection by search engines, while a search based on so many graphic features can identify repeated patterns which are more difficult to conceal.
The study revealed that graphic design, especially colors, plays an important part in the prediction of website categories. Adding graphic properties to other predictive systems can improve their accuracy, and is therefore highly recommended. The study will be presented at the national cyber conference scheduled to take place in September at Tel Aviv University.