dc.description.abstract |
Mining web access log data is a popular technique to identify frequent access patterns of website users. Web logs can provide a wealth of information on the user access patterns of the corresponding website, if and when they are properly analyzed. However, finding interesting patterns hidden in the low-level log data is non-trivial due to large log volumes, and the distribution of the log files in cluster environments.
Existing clustering techniques have not focused on identifying infrequent patterns and most of the clustering techniques suffer from cluster parameter issues, when it comes to web usage mining. This thesis presents the application of Density Based Spatial Clustering of Applications with Noise (DBSCAN) and Expectation Maximization (EM) algorithms in an iterative manner for clustering, which is not a technique that has been used before. Each cluster corresponds to one or more web user activities. For clusters that did not have a unique access pattern, frequent pattern mining and sequence pattern mining techniques were used to identify the unique user access patterns.
Secondly, this thesis solves another issue in web usage mining – detecting slight changes between web user access sessions. This thesis introduces a method to identify these access patterns at a much lower level than what is provided by traditional clustering techniques, such as nearest neighbor based techniques and classification techniques. This technique makes use of the concept of episodes to represent web sessions. These episodes are expressed in the form of regular expressions. To the best of our knowledge, this is the first time that the concept of regular expressions are applied to identify user access patterns in web server log data.
We demonstrate that the implemented system is capable of not only identifying common user behaviors, but also in identify anomalous user behavior. |
en_US |