Abstract:
Mining web access log data is a popular technique
to identify frequent access patterns of website users. There are
many mining techniques such as clustering, sequential pattern
mining and association rule mining to identify these frequent
access patterns. Each can find interesting access patterns and
group the users, but they cannot identify the slight differences
between accesses patterns included in individual clusters. But in
reality these could refer to important information about attacks.
This paper introduces a methodology to identify these access
patterns at a much lower level than what is provided by traditional
clustering techniques, such as nearest neighbour based techniques
and classification techniques. This technique makes use
of the concept of episodes to represent web sessions. These
episodes are expressed in the form of regular expressions. To the
best of our knowledge, this is the first time to apply the concept
of regular expressions to identify user access patterns in web
server log data. In addition to identifying frequent patterns, we
demonstrate that this technique is able to identify access patterns
that occur rarely, which would have been simply treated as noise
in traditional clustering mechanisms.