kafka/checkstyle
Colin Patrick McCabe 555744da70
KAFKA-14124: improve quorum controller fault handling (#12447)
Before trying to commit a batch of records to the __cluster_metadata log, the active controller
should try to apply them to its current in-memory state. If this application process fails, the
active controller process should exit, allowing another node to take leadership. This will prevent
most bad metadata records from ending up in the log and help to surface errors during testing.

Similarly, if the active controller attempts to renounce leadership, and the renunciation process
itself fails, the process should exit. This will help avoid bugs where the active controller
continues in an undefined state.

In contrast, standby controllers that experience metadata application errors should continue on, in
order to avoid a scenario where a bad record brings down the whole controller cluster.  The
intended effect of these changes is to make it harder to commit a bad record to the metadata log,
but to continue to ride out the bad record as well as possible if such a record does get committed.

This PR introduces the FaultHandler interface to implement these concepts. In junit tests, we use a
FaultHandler implementation which does not exit the process. This allows us to avoid terminating
the gradle test runner, which would be very disruptive. It also allows us to ensure that the test
surfaces these exceptions, which we previously were not doing (the mock fault handler stores the
exception).

In addition to the above, this PR fixes a bug where RaftClient#resign was not being called from the
renounce() function. This bug could have resulted in the raft layer not being informed of an active
controller resigning.

Reviewers: David Arthur <mumrah@gmail.com>
2022-08-04 22:49:45 -07:00
..
.scalafmt.conf Add license header in suppressions.xml (#11753) 2022-02-17 14:35:36 +08:00
checkstyle.xml Add license header in suppressions.xml (#11753) 2022-02-17 14:35:36 +08:00
import-control-core.xml KAFKA-14124: improve quorum controller fault handling (#12447) 2022-08-04 22:49:45 -07:00
import-control-jmh-benchmarks.xml Add license header in suppressions.xml (#11753) 2022-02-17 14:35:36 +08:00
import-control.xml KAFKA-14124: improve quorum controller fault handling (#12447) 2022-08-04 22:49:45 -07:00
java.header
suppressions.xml KAFKA-14124: improve quorum controller fault handling (#12447) 2022-08-04 22:49:45 -07:00