[Dshield] Statistical analysis of logfiles--update

Pete Cap peteoutside at yahoo.com
Tue Jan 6 21:00:58 GMT 2004

Hello all,
I posted a while back regarding the application of statistical analysis techniques to examination of logfiles and so forth.  Here's an update on my progress.
I think that while the basic idea remains sound, particularly for comparing differences among many disparate networks, its initial application to the dShield data is limited because of how the data are presented; I can conduct the tests but I do not know how valid the results are.  The most robust test to date seems to be calculating Confidence Intervals.
With regards to the recent port 23 surges, the 99% Confidence Interval Method shows significant spikes as follows:
Sources - As noted, almost entirely steady, with only one significant spike in the past 30 days (December 23, with 407).
Targets - Elevated traffic since 28 December, with truly significant spikes on the 27th, 28th, 30th, and 4-6 January
Records - No significant spikes in the past 30 days
I still have much hope for forecasting methods (e.g. Holt-Winter--thanks, Swa).  If you have a good "handle" on the behavior of your network (ie, how accurately you have set your model parameters), then you ought to be able to predict its activity and compare it with the observed value.  Determining how different they must be in order to raise red flags brings us back to the problems with the statistical hypothesis tests.  I do not like guessing at the model parameters.  Still trying to figure out how to determine them from traffic alone (ie, find a "best fit" given your past 180 days of traffic or so)
It occurs to me that what I am trying to implement here in a very crude fashion is an anomaly-detection scheme, which I have been approaching as a pure traffic analysis problem.  If any of you have this in your background, please feel free to lend a hand (I am doing this with straight math--I have no experience with Snort as of yet).

