Commit Graph

5 Commits

Author SHA1 Message Date
Oliver Gorwits
0bb15f36b9 fixes for race conditions and dupes in job queue
we had situations where the manager would start workers on the same job,
either because of race conditions or because at the time of queueing it wasn't
known that the jobs were targeting the same device (due to device aliases).

this commit removes duplicate jobs, reduces the need for locking on the job
queue, and makes use of lldpRemChassisId to try to deduplicate jobs before
they are started. in effect we have several goes to prevent duplicate jobs:

1. at neighbor discovery time we try to skip queueing same lldpRemChassisId
2. at job selection we 'error out' jobs with same profile as job selected
3. at job selection we check for running job with same profile as selected
4. the job manager process also checks for duplicate job profiles
5. at job lock we abort if the job was 'errored out'

all together this seems to work well. a test on a large university network of
303 devices (four core routers and the rest edge routers, runing VRF with many
duplicate identities), ~1200 subnets, ~50k hosts, resulted in no DB deadlock
or contention and a complete discover+arpnip+macsuck (909 jobs) in ~3 minutes
(with ~150 duplicate jobs identified and skipped).
2017-11-23 19:55:34 +00:00
Oliver Gorwits
65ba93acc0 #342 Job Queue Add Hostname or FQDN 2017-09-21 20:05:38 +01:00
Oliver Gorwits
ef0d326287 retry jobs once per week 2017-06-12 22:31:20 +01:00
Oliver Gorwits
9a72d7e74a Avoid lock/defer of jobs deined via ACL
This commit adds a table 'device_skip' that is used to restrict job queue
searches to avoid jobs that are not permitted on this backend via *_no ACLs,
or jobs on devices that have previously encountered multiple SNMP timeouts.

When the backend loads or a device is added, a row is added to the table if
that device should not be polled on this backend (together with the job
actions which are to be skipped/denied). When a device SNMP connect fails a
counter in the same row (or a new row) is incremented.

There is also a new report 'SNMP Connect Failures' to show the devices with
non-zero SNMP connect failure counters. A configurable limit in the setting
'max_deferrals' is used to set the threshold of no longer polling the device.

To reset the deferrals/failures count, restart the Netdisco backend (which
regenerates 'device_skip' cache entries).

Squashed commit of the following:

commit b5e32c219d
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 20:55:14 2017 +0100

    show all failed connections in report

commit ffce3cee84
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 20:12:39 2017 +0100

    only resolve fqdn once

commit cc4f680f01
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 20:10:20 2017 +0100

    Revert "only resolve fqdn once"

    This reverts commit 3d136a54de.

commit d8d082b30e
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 20:09:05 2017 +0100

    a report to show SNMP failures

commit 3d136a54de
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 19:37:58 2017 +0100

    only resolve fqdn once

commit 4550b8a84c
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 17:27:43 2017 +0100

    skipover now implicit from deferrals/actionset; fix sql where logic with better correlation

commit b51edbccd2
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 16:11:29 2017 +0100

    only abort lock if action matches badactions

commit 415559b24f
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 13:56:42 2017 +0100

    set skipover true when adding to actionset

commit 1086f2c467
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 13:50:56 2017 +0100

    fix empty actionset

commit 31962580b8
Merge: 9b2e993e 6808133b
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 13:25:08 2017 +0100

    Merge branch 'og-device_skip' of github.com:netdisco/netdisco into og-device_skip

commit 6808133bdb
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 13:19:54 2017 +0100

    in-job checks for acls are required for netdisco-do foreground actions

commit 3944dd7813
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 13:18:30 2017 +0100

    avoid extra device lookup

commit 9b2e993e0f
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 12:31:36 2017 +0100

    also delete device_skip rows when deleting device

commit b55854e91d
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 11:34:27 2017 +0100

    actions in device_skip table are now an array/set

commit 5e126eef07
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 09:36:33 2017 +0100

    typo

commit 44266f2767
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 09:14:25 2017 +0100

    *able checks within jobs should not be necessary with skiplist

commit e7c22e7d11
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 08:58:57 2017 +0100

    increment deferrals field when job is deferred

commit 88ae9c00ba
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 08:40:27 2017 +0100

    turn connect fail into defer

commit eac1857043
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue May 23 08:26:59 2017 +0100

    rename failures column to be deferrals

commit 96ed444bbb
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Mon May 22 22:52:51 2017 +0100

    set up list of jobs the backend instance should skip

commit 3a0019296d
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Mon May 22 22:01:50 2017 +0100

    separate out is_*able last_* checks

commit cf8589aba2
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun May 21 22:35:38 2017 +0100

    change from ignore to skip name

commit ed193356f8
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun May 21 14:52:33 2017 +0100

    device_ignore table to track devices to skip in polling
2017-05-27 08:50:08 +01:00
Oliver Gorwits
d23b32500f relocate repo files so ND2 is the only code 2017-04-14 23:08:55 +01:00