we had situations where the manager would start workers on the same job,
either because of race conditions or because at the time of queueing it wasn't
known that the jobs were targeting the same device (due to device aliases).
this commit removes duplicate jobs, reduces the need for locking on the job
queue, and makes use of lldpRemChassisId to try to deduplicate jobs before
they are started. in effect we have several goes to prevent duplicate jobs:
1. at neighbor discovery time we try to skip queueing same lldpRemChassisId
2. at job selection we 'error out' jobs with same profile as job selected
3. at job selection we check for running job with same profile as selected
4. the job manager process also checks for duplicate job profiles
5. at job lock we abort if the job was 'errored out'
all together this seems to work well. a test on a large university network of
303 devices (four core routers and the rest edge routers, runing VRF with many
duplicate identities), ~1200 subnets, ~50k hosts, resulted in no DB deadlock
or contention and a complete discover+arpnip+macsuck (909 jobs) in ~3 minutes
(with ~150 duplicate jobs identified and skipped).
This commit adds a table 'device_skip' that is used to restrict job queue
searches to avoid jobs that are not permitted on this backend via *_no ACLs,
or jobs on devices that have previously encountered multiple SNMP timeouts.
When the backend loads or a device is added, a row is added to the table if
that device should not be polled on this backend (together with the job
actions which are to be skipped/denied). When a device SNMP connect fails a
counter in the same row (or a new row) is incremented.
There is also a new report 'SNMP Connect Failures' to show the devices with
non-zero SNMP connect failure counters. A configurable limit in the setting
'max_deferrals' is used to set the threshold of no longer polling the device.
To reset the deferrals/failures count, restart the Netdisco backend (which
regenerates 'device_skip' cache entries).
Squashed commit of the following:
commit b5e32c219d
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 20:55:14 2017 +0100
show all failed connections in report
commit ffce3cee84
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 20:12:39 2017 +0100
only resolve fqdn once
commit cc4f680f01
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 20:10:20 2017 +0100
Revert "only resolve fqdn once"
This reverts commit 3d136a54de.
commit d8d082b30e
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 20:09:05 2017 +0100
a report to show SNMP failures
commit 3d136a54de
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 19:37:58 2017 +0100
only resolve fqdn once
commit 4550b8a84c
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 17:27:43 2017 +0100
skipover now implicit from deferrals/actionset; fix sql where logic with better correlation
commit b51edbccd2
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 16:11:29 2017 +0100
only abort lock if action matches badactions
commit 415559b24f
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 13:56:42 2017 +0100
set skipover true when adding to actionset
commit 1086f2c467
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 13:50:56 2017 +0100
fix empty actionset
commit 31962580b8
Merge: 9b2e993e6808133b
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 13:25:08 2017 +0100
Merge branch 'og-device_skip' of github.com:netdisco/netdisco into og-device_skip
commit 6808133bdb
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 13:19:54 2017 +0100
in-job checks for acls are required for netdisco-do foreground actions
commit 3944dd7813
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 13:18:30 2017 +0100
avoid extra device lookup
commit 9b2e993e0f
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 12:31:36 2017 +0100
also delete device_skip rows when deleting device
commit b55854e91d
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 11:34:27 2017 +0100
actions in device_skip table are now an array/set
commit 5e126eef07
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 09:36:33 2017 +0100
typo
commit 44266f2767
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 09:14:25 2017 +0100
*able checks within jobs should not be necessary with skiplist
commit e7c22e7d11
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 08:58:57 2017 +0100
increment deferrals field when job is deferred
commit 88ae9c00ba
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 08:40:27 2017 +0100
turn connect fail into defer
commit eac1857043
Author: Oliver Gorwits <oliver@cpan.org>
Date: Tue May 23 08:26:59 2017 +0100
rename failures column to be deferrals
commit 96ed444bbb
Author: Oliver Gorwits <oliver@cpan.org>
Date: Mon May 22 22:52:51 2017 +0100
set up list of jobs the backend instance should skip
commit 3a0019296d
Author: Oliver Gorwits <oliver@cpan.org>
Date: Mon May 22 22:01:50 2017 +0100
separate out is_*able last_* checks
commit cf8589aba2
Author: Oliver Gorwits <oliver@cpan.org>
Date: Sun May 21 22:35:38 2017 +0100
change from ignore to skip name
commit ed193356f8
Author: Oliver Gorwits <oliver@cpan.org>
Date: Sun May 21 14:52:33 2017 +0100
device_ignore table to track devices to skip in polling