Bug bounty: Kill the 'Cyan flash of death'

Thanks @AndyW, it’s good to know that ARP is at least part of the problem. In my professional life, I designed an Ethernet MAC core for FPGAs that does ICMP Echo and ARP in hardware, so I am fairly aware of what kind of things can go wrong.

It sounds like the working theory for your linked TI support case is buffer mismanagement causes ARP cache corruption and the CC3000 re-ARPs for the gateway since it doesn’t know who to talk to.

I think the theory I was putting forth is similar: buffer mismanagement causes ARP cache corruption and the CC3000 cannot figure out how to ARP for any host. This fits CFOD case where the CC3000 goes radio silent, except for ICMP echo, which @wtfuzz said he saw continue for some time.

The idea was that flushing the ARP cache periodically or on failure detection could be a work-around to stay out of or get out of the mute condition. I am afraid that I never see CFOD so I really can’t contribute very much.

Keep fighting the good fight!

I’m at an all time record of 12 hours uptime with aucARP=0 and still going…

That’s great @wtfuzz I am sorry and I see now that I should have read all three pages of the TI support conversation first where they explicitly asked @AndyW to never call ARP flush and force the ARP cache to never timeout. Even then, the :spark: team did not appear to get perfect results. I understand they are having a better private conversation now.

In theory, the ARP cache should be able to be flushed at any time and entries can age out at any time. It is just a cache. As the TI guys said, this can lead to reordering in the TX stream since some TX packets might have to wait for the ARP entry of their destination to reload before transmitting, but the order of packets out of any implementation can differ for lots of reasons. But I would think in the normal use of the :spark: core, the core is only talking to the gateway and not to peers on the local subnet, so there is really only one entry that counts.

What is most stable firmware at this point? How do I flash it on Windows 7?

We just got off a conf. call with TI. They are zeroing in on the problem - the debug dumps are showing that there is a flurry of ARP activity around the time of the failure. There should be no ARP activity at all in our test cases, so something is obviously going wrong that can trigger a more catastrophic failure.
The root cause is currently not known, but TI and Spark are making progress towards resolution.

In addition, I believe that a recent minor change to the cloud software (it now sends one packet, where before it was sending two in quick succession) can help improve the life expectancy before CFOD. This is obviously not a fix, but if people are seeing the MTTF improve, that is one possible cause.

More updates as we get them.

2 Likes

Great news @AndyW!

If anyone who is experiencing lock-ups with CFOD has an easy way to monitor uptime of some polled sensor data, I would be interested to see if my proposal for watchdog timer fixes helps you:

Note this code is basically the latest Core-Firmware Master with my changes, so any other CFOD fixes are not present. I think it's a fair test, and any additional CFOD fixes should make things more stable. The watchdog stuff just keeps the random weird bugs from completely crippling your application.

@david_s5
I ran your binary on our office network, and soon enough the Core entered a stage of perpetual green flashing and never got back online. Are these binaries generated from the latest master branch on your repo?

I have pasted the log here: http://pastebin.com/zFi2vCMz

Please let me know if that helps.

As I told you earlier @BDub I have 7 cores running your modified watchdog and they are all above 24 hour uptime and ticking. Looking very promising for my intended use, a sensor that collects data and a Node server that polls that with a set frequency and saves it to Parse.

2 Likes

@mohit

Failure mode looks different. Like the sparkprotocol was connected. The log does not look like it matches
0_inact_20_sec_recovery_log_core-firmware.zip’ bin

Please confirm it was http://nscdg.com/spark/0_inact_20_sec_recovery_log_core-firmware.zip

I am waiting on @zachary to do a merge with the timer tick changes. Then I will merg l push the core-firmware code

@david_s5
To make sure, I re-flashed the Core and saved the logs again. The logs look different, but the behavior is the same. There was one CFOD recovery followed by a perpetual green flashing.

http://pastebin.com/ABETchNE

Thanks!

@david_s5 @mohit , @BDub Not sure if this is helpful but I get the spark core lockup even when it shows a the breathing Cyan.

My Wifi lost is connection to the Net which caused the my main loop to freeze and the core must have froze also because kept showing the normal breathing Cyan.

I’m about to install BDubs watchdog timer so as I get a second to figure out how to do it but I have a feeling its going to eliminate these issues by auto resetting the core.

Video showing this

Here’s my thoughts on this problem

pdf

@RWB @mohit @BDub - I have a major set of changes that stabilize the core. I will be merging and setting up a PR shortly. With it we should always know were the code is running if hung. All Faults yield a Panic code. RED LED SOS followed by the error code. It provides for orderly timeout on network calls with full logging. Stay tuned.

This link works in my browser

1 Like

@mohit The logs still look ODD. How are you capturing the data? See the *** Is this a clue? Volatage OK?

0000063020:<DEBUG> virtual size_t UDP::write(const uint8_t*, size_t) (133):**(Clo)**
0000063028:<DEBUG> virtual size_t UDP::write(const uint8_t*, size_t) (135):**send9**
0000063038:<DEBUG> virtual size_t UDP::write(const uint8_t*, size_t) (136):**(Clo)**

Thanks for sharing! Hope it helps somehow.

This looks very detailed Thank you!

I will spend some time to see if the new spi driver (https://github.com/spark/core-common-lib) has mitigated all the issues you raised.

Thank you again! @pra

@mohit - All my changes are checked in and merged with the latest spark masters.

Please build and test with: https://github.com/spark/core-common-lib etal

Aloha David

I have an adafruit breakout … I’ll fire it up when I get a chance.

Tonight I wrote a python script that can be used to force the CC3000 into CFOD and/or unresponsiveness using ARP flooding. This should help with testing as it speeds up the process of making the CC3000 puke instead of having to wait for it with normal network use. Maybe the TI guys can also use it when fixing the firmware? Anyway, I’ve run it quite a few times and it works 100% of the time and it knocks the CC3000 off the network almost immediately (but you can still ping the core). It doesn’t make the LED flash immediately, but the core cannot be accessed via the API. You’ll get a ‘timed out’ error.

If anyone has any other CC3000 modules besides the spark core, I’d like to hear if this script affects them. Such as the adafruit breakout board.

gist for cc3000flood.py

3 Likes