Anomaly detection in application logs using machine learning
Company: Skyresponse AB
Skyresponse AB develop and operate a cloud service for receiving and handling of alarms and events. At the time of writing the rate of alarms continuously received and handled in the system is about 10 million alarms per month.
The system is completely hosted in Amazon cloud, built using modern techniques for cloud services such as micro services, serverless functions, api gateways and server instances hosted in cloud.
The company’s head quarter is in Stockholm while the development team has its office in Luleå and Ukraine.
Master Degree project
System monitoring is an important part of operations for the services and Skyresponse has a substantial number of system alarms for detecting problems like high cpu load, but it is often more difficult to find the root cause for a problem. Skyresponse’s core applications output between 1000 to 1500 log prints per second. To manually find the relevant logs that could explain a detected problem, when it is not known beforehand what to search for, is time consuming.
Log inspection is also important in relation to releases of new application versions. Skyresponse works with continuous delivery and releases updated versions of the applications every 4:th week. These releases add, remove, and alter logs printed from the applications, which increases the challenge in detecting unwanted changes. After release of an application the logs are inspected and searched for errors, to detect potential problems with the release, but it can be difficult to manually identify unwanted changes at an early stage.
The scope of work for the degree project is to:
- Investigate and propose a technique to detect anomalies in the logs printed by the application.
- What preprocessing of the log data is needed?
- How to handle new, deleted and changed logs upon releases?
- Can logs printed in the test environment be used for pre-learning?
- What anomaly detection models can be used?
- Test proposed technique on old logs.
- Skyresponse has logs from several years back in time, stored in the cloud.
- Use of Amazon cloud services, i.e. Sagemaker for preprocessing, training, and evaluation.
- Write report on findings