Hi
We've encountered a situation in our Production Cluster where we've run out of disk space and as a result our nodes have not been able to write out the stats gfs files.
We recover disc space but the stats never seem to recover and write without a re-start. Nodes where this failure has occured also report 0% heap sizes at all times.
A more worrying side effect is that that the nodes, if left without a restart, will slowly run out of memory. Is this OOM as a result of stats building up in the node but not being flushed to disk? Other nodes in the same cluster that have not had the issue all seem to be balanced normally, only nodes where stats threads have died show a gradual increase in memory usage (until eventual OOM) that does not recover with a full GC. This OOM is very slow, over a number of days, which makes me think it may be a stats build up in memory.
In terms of other writes to disk from these nodes we use Log4J for logging, and as far as I can see from a Google search on "Log4j disk full" it just throws an exception, it doesn't queue up writes, consuming memory.
Thanks
David.