Hi, I detect a major probleminmy application, for a unknown reason the locator log show messages of "Suspect notification for member" , and after a while the locator is forced out of the distributed system, this is a major problem because without the locator i loose the connectivity with the server.
The architectue of my application is a lot of clients that are connected with a server with two nodes, for each node I use a locator.
In the logs you can see the ip = 172.29.0.179 ( is the ip of a node) and 172.29.0.178 is the ip of the other node, and this logs show the locator that is starting in 172.29.0.178
The lines of the log are this:
[info 2013/12/13 16:55:03.907 ART <main> tid=0x1] Starting distributed system
[info 2013/12/13 16:55:04.227 ART <main> tid=0x1] GemFire P2P Listener started on tcp:///172.29.0.178:59357
[info 2013/12/13 16:55:04.344 ART <main> tid=0x1] Attempting to join distributed system whose membership coordinator is 172.29.0.179(31524)<v75>:34667 using membership ID datac-r35-07(22965):56765
[info 2013/12/13 16:55:06.084 ART <main> tid=0x1] Entered into membership in group GF66 with ID datac-r35-07(22965:admin)<v86>:56765/59357.
[info 2013/12/13 16:55:06.085 ART <main> tid=0x1] Starting DistributionManager datac-r35-07(22965:admin)<v86>:56765/59357.
[info 2013/12/13 16:55:06.086 ART <main> tid=0x1] Initial (membershipManager) view = [172.29.0.179(31524:admin)<v75>:34667/44384, 172.29.0.179(31623:admin)<v76>:49692/36785, 172.29.0.179(31733)<v77>:38302/46653, datac-r35-07(22965:admin)<v86>:56765/59357]
[info 2013/12/13 16:55:06.086 ART <main> tid=0x1] DMMembership: Admitting new administration member < 172.29.0.179(31524:admin)<v75>:34667/44384 >.
[info 2013/12/13 16:55:06.086 ART <main> tid=0x1] DMMembership: Admitting new administration member < 172.29.0.179(31623:admin)<v76>:49692/36785 >.
[info 2013/12/13 16:55:06.086 ART <main> tid=0x1] Admitting member <172.29.0.179(31733)<v77>:38302/46653>. Now there are 3 non-admin member(s).
[info 2013/12/13 16:55:06.087 ART <main> tid=0x1] DMMembership: Admitting new administration member < datac-r35-07(22965:admin)<v86>:56765/59357 >.
[info 2013/12/13 16:55:06.154 ART <main> tid=0x1] DistributionManager datac-r35-07(22965:admin)<v86>:56765/59357 started on datac-r35-07.gire.com[55421],datac-r35-08.gire.com[55421]. There were 1 other DMs. others: [172.29.0.179(31733)<v77>:38302/46653] (admin only)
[info 2013/12/13 16:55:06.163 ART <main> tid=0x1] Locator started on 172.29.0.178[55421]
[info 2013/12/13 16:55:06.163 ART <main> tid=0x1] Starting server location for Distribution Locator on datac-r35-07[55421]
[info 2013/12/13 16:55:12.399 ART <UDP ucast receiver> tid=0x1d] Membership: received new view [172.29.0.179(31524)<v75>:34667|87] [172.29.0.179(31524)<v75>:34667/44384, 172.29.0.179(31623)<v76>:49692/36785, 172.29.0.179(31733)<v77>:38302/46653, datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491]
[info 2013/12/13 16:55:12.419 ART <View Message Processor> tid=0x31] DMMembership: Admitting new administration member < datac-r35-07(23067:admin)<v87>:41624/44491 >.
[info 2013/12/13 16:55:19.526 ART <UDP ucast receiver> tid=0x1d] Membership: received new view [172.29.0.179(31524)<v75>:34667|88] [172.29.0.179(31524)<v75>:34667/44384, 172.29.0.179(31623)<v76>:49692/36785, 172.29.0.179(31733)<v77>:38302/46653, datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097]
[info 2013/12/13 16:55:19.539 ART <View Message Processor> tid=0x31] Admitting member <datac-r35-07(23184)<v88>:13261/47097>. Now there are 6 non-admin member(s).
[info 2013/12/13 16:59:21.292 ART <P2P message reader for 172.29.0.179(31733)<v77>:38302/46653 SHARED=true ORDERED=true UID=18> tid=0x3c] Member at 172.29.0.179(31733)<v77>:38302/46653 gracefully left the distributed cache: shutdown message received
[info 2013/12/13 16:59:21.542 ART <UDP ucast receiver> tid=0x1d] Received Suspect notification for member(s) [172.29.0.179(31733)<v77>:38302] from 172.29.0.179(31623)<v76>:49692.
[info 2013/12/13 16:59:21.568 ART <UDP ucast receiver> tid=0x1d] Membership: received new view [172.29.0.179(31524)<v75>:34667|89] [172.29.0.179(31524)<v75>:34667/44384, 172.29.0.179(31623)<v76>:49692/36785, datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097]
[info 2013/12/13 16:59:25.309 ART <UDP ucast receiver> tid=0x1d] Received Suspect notification for member(s) [172.29.0.179(31623)<v76>:49692] from 172.29.0.179(31524)<v75>:34667.
[info 2013/12/13 16:59:25.325 ART <UDP ucast receiver> tid=0x1d] Membership: received new view [172.29.0.179(31524)<v75>:34667|90] [172.29.0.179(31524)<v75>:34667/44384, datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097]
[info 2013/12/13 16:59:26.542 ART <VERIFY_SUSPECT.TimerThread> tid=0x3e] No suspect verification response received from 172.29.0.179(31733)<v77>:38302 in 5000 milliseconds: I believe it is dead.
[info 2013/12/13 16:59:28.590 ART <UDP ucast receiver> tid=0x1d] Membership: received new view [172.29.0.179(31524)<v75>:34667|91] [datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097]
[info 2013/12/13 16:59:31.543 ART <VERIFY_SUSPECT.TimerThread> tid=0x3e] No suspect verification response received from 172.29.0.179(31623)<v76>:49692 in 6234 milliseconds: I believe it is dead.
[info 2013/12/13 16:59:37.778 ART <VERIFY_SUSPECT.TimerThread> tid=0x3e] No suspect verification response received from 172.29.0.179(31524)<v75>:34667 in 9156 milliseconds: I believe it is dead.
[info 2013/12/13 16:59:49.949 ART <Timer-4> tid=0x1a] Could not connect to distribution locator datac-r35-08<v0>:55421: java.net.ConnectException: Connection refused
[info 2013/12/13 17:00:47.071 ART <Timer-4> tid=0x1a] Could not connect to distribution locator datac-r35-08<v0>:55421: java.net.ConnectException: Connection refused
[info 2013/12/13 17:01:44.195 ART <Timer-4> tid=0x1a] Could not connect to distribution locator datac-r35-08<v0>:55421: java.net.ConnectException: Connection refused
[info 2013/12/13 17:02:41.317 ART <Timer-4> tid=0x1a] Could not connect to distribution locator datac-r35-08<v0>:55421: java.net.ConnectException: Connection refused
[info 2013/12/13 17:03:38.440 ART <Timer-4> tid=0x1a] Could not connect to distribution locator datac-r35-08<v0>:55421: java.net.ConnectException: Connection refused
[info 2013/12/13 17:04:02.356 ART <ViewHandler> tid=0x4f] Membership: sending new view [[datac-r35-07(22965)<v86>:56765|92] [datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097, 172.29.0.179(27638)<v92>:39650/55857]] (4 mbrs)
[info 2013/12/13 17:04:02.370 ART <UDP Incoming Message Handler> tid=0x1c] Membership: received new view [datac-r35-07(22965)<v86>:56765|92] [datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097, 172.29.0.179(27638)<v92>:39650/55857]
[info 2013/12/13 17:04:02.391 ART <View Message Processor> tid=0x31] DMMembership: Admitting new administration member < 172.29.0.179(27638:admin)<v92>:39650/55857 >.
[info 2013/12/13 17:04:08.348 ART <ViewHandler> tid=0x4f] Membership: sending new view [[datac-r35-07(22965)<v86>:56765|93] [datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097, 172.29.0.179(27638)<v92>:39650/55857, 172.29.0.179(27739)<v93>:1619/43425]] (5 mbrs)
[info 2013/12/13 17:04:08.362 ART <UDP Incoming Message Handler> tid=0x1c] Membership: received new view [datac-r35-07(22965)<v86>:56765|93] [datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097, 172.29.0.179(27638)<v92>:39650/55857, 172.29.0.179(27739)<v93>:1619/43425]
[info 2013/12/13 17:04:08.381 ART <View Message Processor> tid=0x31] DMMembership: Admitting new administration member < 172.29.0.179(27739:admin)<v93>:1619/43425 >.
[info 2013/12/13 17:04:15.542 ART <ViewHandler> tid=0x4f] Membership: sending new view [[datac-r35-07(22965)<v86>:56765|94] [datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097, 172.29.0.179(27638)<v92>:39650/55857, 172.29.0.179(27739)<v93>:1619/43425, 172.29.0.179(27857)<v94>:36916/47253]] (6 mbrs)
[info 2013/12/13 17:04:15.558 ART <UDP Incoming Message Handler> tid=0x1c] Membership: received new view [datac-r35-07(22965)<v86>:56765|94] [datac-r35-07(22965)<v86>:56765/59357, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097, 172.29.0.179(27638)<v92>:39650/55857, 172.29.0.179(27739)<v93>:1619/43425, 172.29.0.179(27857)<v94>:36916/47253]
[info 2013/12/13 17:04:15.577 ART <View Message Processor> tid=0x31] Admitting member <172.29.0.179(27857)<v94>:36916/47253>. Now there are 6 non-admin member(s).
[info 2013/12/14 02:23:53.283 ART <Timer-4> tid=0x1a] Could not connect to distribution locator datac-r35-07<v0>:55421: java.net.SocketException: Socket closed
[info 2013/12/14 09:08:11.416 ART <UDP ucast receiver> tid=0x1d] Received Suspect notification for member(s) [datac-r35-07(22965)<v86>:56765] from 172.29.0.179(27857)<v94>:36916.
[info 2013/12/14 09:08:11.598 ART <UDP ucast receiver> tid=0x1d] Membership: received new view [172.29.0.179(27638)<v92>:39650|105] [172.29.0.179(27638)<v92>:39650/55857, datac-r35-07(23067)<v87>:41624/44491, datac-r35-07(23184)<v88>:13261/47097, 172.29.0.179(27739)<v93>:1619/43425, 172.29.0.179(27857)<v94>:36916/47253] crashed mbrs: [datac-r35-07(22965)<v86>:56765/59357]
[severe 2013/12/14 09:08:11.608 ART <CloserThread> tid=0x12c] Membership service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system by 172.29.0.179(27638)<v92>:39650. Please consult GemFire logs to find the reason. (GMS shun)
[info 2013/12/14 09:08:11.609 ART <CloserThread> tid=0x12c] Stopping Distribution Locator on datac-r35-07[55421]
[info 2013/12/14 09:08:11.622 ART <CloserThread> tid=0x12c] Disconnecting distributed system for Distribution Locator on datac-r35-07[55421]
[info 2013/12/14 09:08:11.623 ART <CloserThread> tid=0x12c] Shutting down DistributionManager datac-r35-07(22965:admin)<v86>:56765/59357.
[info 2013/12/14 09:08:11.623 ART <main> tid=0x1] Locator stopped
What's mean "Received Suspect notification" ?
Why did happen?
Thanks,
Juan