11.07.2015 Views

Nagios-Surfer - A Quality Reporting Tool for Nagios

Nagios-Surfer - A Quality Reporting Tool for Nagios

Nagios-Surfer - A Quality Reporting Tool for Nagios

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

FunetThe Finnish NREN80 connectedorganizations375000 end usersOperated by CSC –IT Center <strong>for</strong>Science Ltd2


IntroductionFunet uses <strong>Nagios</strong> extensively <strong>for</strong>monitoring.– network– servers– servicesTwo <strong>Nagios</strong> monitoring servers– Over 900 monitored hosts– Over 10000 monitored services<strong>Nagios</strong> suits us well.3


Challenges:Huge number of servicesPossibility <strong>for</strong> configuration errorsWe needed an administrative overview ofthe used monitoring configuration.– What is monitored and how?– Who receives notifications and when?– What differences are there?4


Challenges:<strong>Nagios</strong> availability reporting is slowA large number of services means that the<strong>Nagios</strong> event log files become large– 1.5 GB of event log per month– 18 GB of event log per year<strong>Nagios</strong> avail.cgi reads through event logseach time an availability report isrequested.– Generating an availability report may takeseveral minutes.5


Challenges:Finding the biggest availabilityissuesHow do we find the biggest issues from10000 services?– How to take different service availabilityrequirements into account?If service availability is low, what hascaused it?How to in<strong>for</strong>m service administrators aboutavailability issues?– Short breaks may not have triggered <strong>Nagios</strong> e-mail notifications.6


Introducing <strong>Nagios</strong>-<strong>Surfer</strong>Developed at Funet to help with thea<strong>for</strong>ementioned issues in <strong>Nagios</strong> reporting.Provides:– Administrative reports about <strong>Nagios</strong>configuration– Availability reports– Event logs– <strong>Tool</strong>s <strong>for</strong> categorizing service breaks andin<strong>for</strong>ming administrators about unexplaineddowntime.7


Overview8


<strong>Nagios</strong>-<strong>Surfer</strong> features:<strong>Nagios</strong> configuration overviewAutomatically generated <strong>for</strong> all hosts,services, contacts, and groups.Reports in<strong>for</strong>mation about– Service checks– NotificationsReports differences between themonitoring configuration of hosts orservices in the same group.9


<strong>Nagios</strong>-<strong>Surfer</strong> features:Availability ReportsAutomatically generates availability reportsof all hosts, services, contacts and groups.Availability reports are pregenerated.– No need to wait.Availability numbers are reported permonth.– Also availability excluding scheduled downtimeis reported.10


<strong>Nagios</strong>-<strong>Surfer</strong> features:Event logs<strong>Nagios</strong>-<strong>Surfer</strong> generates monthly event logsummaries of all hosts and services.– Redundant in<strong>for</strong>mation, such as duplicate andsubsequent OK lines are removed.Each break contains a link to a <strong>for</strong>m whichcan be used to examine and modify thecategorization and description of thebreak.Event logs can be accessed easily throughthe availability reports.11


<strong>Nagios</strong>-<strong>Surfer</strong> features:Break categorization<strong>Nagios</strong>-<strong>Surfer</strong> can send break clarificationrequests to administrators by e-mail.Administrators can categorize anddescribe breaks. The in<strong>for</strong>mation is savedto <strong>Nagios</strong>-<strong>Surfer</strong> database <strong>for</strong> later use.If a break is categorized as scheduleddowntime, the change will be reflected inthe availability reports.– If a break happens during <strong>Nagios</strong> scheduleddowntime, the break is automaticallycategorized as scheduled downtime. 12


Example: Configuration overview13


Example: Availability report14


Example: Event log15


Example: Break in<strong>for</strong>mation16


A tool <strong>for</strong> quality assuranceWe have internal quality assuranceprocesses that oversee that services meetthe set reliability requirements.<strong>Nagios</strong>-<strong>Surfer</strong> allows a quality assuranceteam to use better data.– Service administrators investigate new servicebreaks and save the in<strong>for</strong>mation to <strong>Nagios</strong>-<strong>Surfer</strong>.– When the causes are known, a qualityassurance process can concentrate on themost relevant issues.17


Archiving in<strong>for</strong>mation about breaks<strong>Nagios</strong>-<strong>Surfer</strong> provides a central place <strong>for</strong>in<strong>for</strong>mation about service disruptions.– <strong>Nagios</strong>-<strong>Surfer</strong> does not <strong>for</strong>get what happened.Makes it easier to notice patterns.– Investigating old issues becomes easier too,as the breaks of possible servicedependencies are visible.– Works also across organizational boundaries.There is no need to send e-mail and wait <strong>for</strong> ananswer.18


Providing availability reports to enduserorganizationsAn organization connected to Funet will beable to see the availability history of allused services at a glance.– IP connections– Light paths– … and more?Availability data is provided by <strong>Nagios</strong>-<strong>Surfer</strong>.Work in progress19


Other uses <strong>for</strong> <strong>Nagios</strong>-<strong>Surfer</strong>componentsThe <strong>Nagios</strong> configuration API of <strong>Nagios</strong>-<strong>Surfer</strong> is used in other tools developed atFunet.– Scheduling <strong>Nagios</strong> downtime according topredefined templates.Server X is rebooted – affects also services Y and Z.The breaks in all affected services are documentedautomatically in <strong>Nagios</strong>-<strong>Surfer</strong>.– Combining several <strong>Nagios</strong> service groups intoone large service group.20


Technical detailsObject-oriented Perl– ~15k lines, ~360kB of codeSQLite– For saving in<strong>for</strong>mation about breaks andscheduled downtime– For caching <strong>Nagios</strong> configuration dataReports use XHTML, CSS and someJavaScript.21


Questions?For more in<strong>for</strong>mation:teemu.kiviniemi@csc.fi22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!