Bug bounty: Kill the 'Cyan flash of death'

I was so excited to get these fixes pushed to the compile-server2 branch (i.e. what the Web based IDE uses) and try them out. You can only imagine how disappointed I was when I found my core CFoD-ing with great regularity. :frowning:

But wait! I read the post listed below and discovered that if you flash the same sketch WITHOUT MODIFYING ANYTHING, it just pushes the "old" compiled binary to the Spark - it doesn't re-compile with the new changes

So I changed my sketch, hit the "Verify" button, and pushed it to the Spark. My sketch has been running for almost 4 hours now and I've had two "recoveries" which normally would have CFOD-ed the Spark, but it just re-connects and continues on. This is just awesome!

Thank you so much, @david_s5, for your solution to this. Also many thanks to so many others for helping out. Many hands make light work!

Dave O

3 Likes

Exact same thing happened to me :smile:

Its working out great for me also, I’m putting it through some pretty abnormal situations also and its doing just fine.

1 Like

Quick update re: TI; they sent us an engineering drop for a potential firmware fix but it didn’t have any effect. @mohit has provided them feedback, so we’re still iterating on the underlying root cause with Texas Instruments. Will keep you posted as we hear more!

5 Likes

Update: We’ll be receiving the second round of CC3K patches from TI this Sunday. These patches will focus on resolving the invalidation of the ARP table inside the module. Shall post the results upon testing!

9 Likes

This bug doesn’t stand a chance! Its days are numbered…
:slight_smile:

3 Likes

Any news regarding the cc3k patches?

2 Likes

Thanks for checking in @altery
The patch that was suppose to arrive last week, came out this morning. We are testing it right now as we speak. Will report shortly.

7 Likes

:+1: any luck with latest firmware from TI? @mohit

1 Like

OK - finally some good news.
We received a firmware workaround to test from TI on Monday.
Results so far are good, including a stress test that has run without error for 42 hours and over 200,000 passes - numbers we have never been able to achieve before.
Important: this test has not required any of the other error recovery mechanisms, the core has not rebooted during this time, a single connection to the cloud has been maintained, no transactions have been lost etc etc.

These are good test results, but that’s not the end of it, yet.

In summary:

  • We should soon be able to release a patch to the CC3000 firmware that vastly reduces the probability of CFOD.
  • The exact mechanism for this release is currently under development.
  • There will need to be subsequent patch(es) to the CC3000 that will fix the root cause of CFOD.

Here are the gory details for those interested:

  1. This workaround has not yet been put through TI’s official test and release procedure; we have a conference call with them tomorrow morning (USA time) to discuss their testing & release schedule. They are looking for a slot in their test plan so that they can release a formal patch as quickly as possible. Look for a further update after that meeting.
  2. We here at Spark need to package up CC3000 firmware upgrades in a manner that is: straightforward; reliable; easy to comprehend; and simple for all users.
  3. I use the word workaround above intentionally. This firmware update does not fix the root cause of the CC3000 failures that lead to CFOD; however it does appear to avoid the most common situations that lead to the CC3000 failure, and thence CFOD. Here is what we believe is going on - I’ll apologise in advance if parts of this are not explained clearly - we have been deep inside this problem for so long it feels like family.
  • The root cause of the CC3000 failures that lead to CFOD is buffer starvation and/or allocation issues that can result in a deadlock.
  • The situation is that the CC3000 has buffers to send, but finds it also needs buffers for either ARP packets or TCP protocol packets before it can proceed and transmit the data, but there are none available.
  • This is a complex problem and anyone who has written a TCP/IP stack on limited resource hardware is familiar with these kind of issues.
  • For whatever reason, fixing this in the CC3000 is proving extraordinarily difficult.
  • In addition, the current behaviour of the CC3000 is to continually update its ARP cache based on packets it receives, regardless of whether those packets form part of ongoing traffic through the CC3000.
  • With this behaviour, if the ARP cache is already full, then a random ARP cache entry is chosen and replaced.
  • In the Core, if that ARP entry ejected is the one for the default gateway, and there are already packet(s) enqueued in the CC3000 ready to be sent to the cloud, then the CC3000 must ARP to find the MAC address of the default gateway.
  • This is apparently when the CC3000 can find itself in a deadlock, needing buffers to send the ARP request and process the reply, but not having any available.
  • So there are a series of events, each individually contributing to the probability that the CC3000 will fail.
  • This explains why: busy networks made CFOD more likely (more packets to pollute the ARP cache); busy applications made CFOD more likely (more chance that the CC3000 will have pending TCP traffic to the cloud at any moment); time to failure was highly variable (the random chance that the ARP entry ejected would belong to the Core’s TCP connection to the cloud.)
  • The fix we have been testing stops the automated update of the ARP cache based on packets received. It is now only updated on a need-to basis.
  • So, while it does not fix the root cause resource allocation/starvation problem, it appears to vastly reduce the probability that this bug will be triggered in practical use.

Note: Errors in this post should be considered mine, not TI’s.

17 Likes

Thanks for the detailed writeup on the status for CFOD!

I believe this would already bring the probability of encountering CFOD on a regular basis to a much smaller % and get more people working on the :spark: core

:smile:

Wow this is super great news! Thanks for your work on this @AndyW and the detailed write up :smile:

I’m pretty familiar with running the CC3000 patching firmware, and it’s not too hard for most of the local programming guys… so hopefully it can be released as a BIN for anyone that wants to attempt it the manual way, while some fancy but reliable update mechanism is being worked on for most users that probably just use the Sparkulator.

2 Likes

I’m sure they can just write a sketch that updates the CC3000 over SPI, then take the compiled HEX file and use the Cloud Update mechanism to place an “Update CC3000 Firmware” button on the Sparkulator. Perhaps under each Core in the device list?

It’s not like it’s hard to update the firmware. I’ve done it manually via UART, through a program compiled in CCS and through a sketch with Energia.

The above method is similar to how TI officially does the firmware updates for the EVM with their windows software. They’ve release a compiled MSP430 program that you upload to your board and it takes updates the CC3000 over SPI. (The MSP430 binary contains the firmware.)

1 Like

@AndyW ditto on what @BDub said, you are an awesome engineer and an incredible asset to Spark! Thank you for the great news!

1 Like

@AndyW Thanks for the update and all your hard work on a solution to the CFOD! Thanks to everyone else who contributed as well!

1 Like

Yeah, thanks for staying on top of this @AndyW, not just on behalf of the :spark: Community, but also 43Oh.com and all the people out there with buggy CC3000 modules! :heartpulse:

3 Likes

Ditto on the thanks @AndyW !

I am really looking forward to seeing this one in the rear view mirror.

great news guys! @AndyW i owe you a :beer: …thanks for the detailed update…great stuff!! I thought it might have been a buffer / resource issue. Is there hex & eeprom files for the cc3k that need to get deployed? If so, I have not re-flashed the CC3k before is there a guide to doing so somewhere?

The promised update after our call today with TI:

  • We will continue to collaborate with TI and test an official patch candidate (we have something direct from TI engineering at present.)
  • If testing does not raise any issues, the plan is for a formal, validated & supported patch to the CC3000 firmware to be available March 28/29.
  • We will make this available on the master branch of the CC3000-patch-programmer for users who are comfortable applying patches manually. We will update this topic when that update takes place.
  • The next sprint (beginning 31-March) will implement an Over The Air update process for users who are not comfortable applying patches manually, or cannot for other reasons.
  • The plan is for the next production run of cores to ship with this patch already applied.

TI remain committed to addressing the underlying root causes of the CC3000 failures, even with the vastly reduced probability of failure.

11 Likes

Great news and great work AndyW! Impressive that a small team can get so much commitment and action from the TI behemoth. :ok_hand:

3 Likes

They may be a small team, but the spark core is a game changer - and they are producing tens of thousands of units IIRC, that’s gotta have some clout! :oncoming_automobile:

4 Likes