Bug bounty: Kill the 'Cyan flash of death'

OK - finally some good news.
We received a firmware workaround to test from TI on Monday.
Results so far are good, including a stress test that has run without error for 42 hours and over 200,000 passes - numbers we have never been able to achieve before.
Important: this test has not required any of the other error recovery mechanisms, the core has not rebooted during this time, a single connection to the cloud has been maintained, no transactions have been lost etc etc.

These are good test results, but that’s not the end of it, yet.

In summary:

  • We should soon be able to release a patch to the CC3000 firmware that vastly reduces the probability of CFOD.
  • The exact mechanism for this release is currently under development.
  • There will need to be subsequent patch(es) to the CC3000 that will fix the root cause of CFOD.

Here are the gory details for those interested:

  1. This workaround has not yet been put through TI’s official test and release procedure; we have a conference call with them tomorrow morning (USA time) to discuss their testing & release schedule. They are looking for a slot in their test plan so that they can release a formal patch as quickly as possible. Look for a further update after that meeting.
  2. We here at Spark need to package up CC3000 firmware upgrades in a manner that is: straightforward; reliable; easy to comprehend; and simple for all users.
  3. I use the word workaround above intentionally. This firmware update does not fix the root cause of the CC3000 failures that lead to CFOD; however it does appear to avoid the most common situations that lead to the CC3000 failure, and thence CFOD. Here is what we believe is going on - I’ll apologise in advance if parts of this are not explained clearly - we have been deep inside this problem for so long it feels like family.
  • The root cause of the CC3000 failures that lead to CFOD is buffer starvation and/or allocation issues that can result in a deadlock.
  • The situation is that the CC3000 has buffers to send, but finds it also needs buffers for either ARP packets or TCP protocol packets before it can proceed and transmit the data, but there are none available.
  • This is a complex problem and anyone who has written a TCP/IP stack on limited resource hardware is familiar with these kind of issues.
  • For whatever reason, fixing this in the CC3000 is proving extraordinarily difficult.
  • In addition, the current behaviour of the CC3000 is to continually update its ARP cache based on packets it receives, regardless of whether those packets form part of ongoing traffic through the CC3000.
  • With this behaviour, if the ARP cache is already full, then a random ARP cache entry is chosen and replaced.
  • In the Core, if that ARP entry ejected is the one for the default gateway, and there are already packet(s) enqueued in the CC3000 ready to be sent to the cloud, then the CC3000 must ARP to find the MAC address of the default gateway.
  • This is apparently when the CC3000 can find itself in a deadlock, needing buffers to send the ARP request and process the reply, but not having any available.
  • So there are a series of events, each individually contributing to the probability that the CC3000 will fail.
  • This explains why: busy networks made CFOD more likely (more packets to pollute the ARP cache); busy applications made CFOD more likely (more chance that the CC3000 will have pending TCP traffic to the cloud at any moment); time to failure was highly variable (the random chance that the ARP entry ejected would belong to the Core’s TCP connection to the cloud.)
  • The fix we have been testing stops the automated update of the ARP cache based on packets received. It is now only updated on a need-to basis.
  • So, while it does not fix the root cause resource allocation/starvation problem, it appears to vastly reduce the probability that this bug will be triggered in practical use.

Note: Errors in this post should be considered mine, not TI’s.

17 Likes