Bug bounty: Kill the 'Cyan flash of death'

bko · February 11, 2014, 3:17am

Thanks @AndyW, it’s good to know that ARP is at least part of the problem. In my professional life, I designed an Ethernet MAC core for FPGAs that does ICMP Echo and ARP in hardware, so I am fairly aware of what kind of things can go wrong.

It sounds like the working theory for your linked TI support case is buffer mismanagement causes ARP cache corruption and the CC3000 re-ARPs for the gateway since it doesn’t know who to talk to.

I think the theory I was putting forth is similar: buffer mismanagement causes ARP cache corruption and the CC3000 cannot figure out how to ARP for any host. This fits CFOD case where the CC3000 goes radio silent, except for ICMP echo, which @wtfuzz said he saw continue for some time.

The idea was that flushing the ARP cache periodically or on failure detection could be a work-around to stay out of or get out of the mute condition. I am afraid that I never see CFOD so I really can’t contribute very much.

Keep fighting the good fight!

wtfuzz · February 12, 2014, 8:00am

I’m at an all time record of 12 hours uptime with aucARP=0 and still going…

bko · February 12, 2014, 12:28pm

That’s great @wtfuzz I am sorry and I see now that I should have read all three pages of the TI support conversation first where they explicitly asked @AndyW to never call ARP flush and force the ARP cache to never timeout. Even then, the team did not appear to get perfect results. I understand they are having a better private conversation now.

In theory, the ARP cache should be able to be flushed at any time and entries can age out at any time. It is just a cache. As the TI guys said, this can lead to reordering in the TX stream since some TX packets might have to wait for the ARP entry of their destination to reload before transmitting, but the order of packets out of any implementation can differ for lots of reasons. But I would think in the normal use of the core, the core is only talking to the gateway and not to peers on the local subnet, so there is really only one entry that counts.

ryotsuke · February 12, 2014, 12:54pm

What is most stable firmware at this point? How do I flash it on Windows 7?

AndyW · February 12, 2014, 3:22pm

We just got off a conf. call with TI. They are zeroing in on the problem - the debug dumps are showing that there is a flurry of ARP activity around the time of the failure. There should be no ARP activity at all in our test cases, so something is obviously going wrong that can trigger a more catastrophic failure.
The root cause is currently not known, but TI and Spark are making progress towards resolution.

In addition, I believe that a recent minor change to the cloud software (it now sends one packet, where before it was sending two in quick succession) can help improve the life expectancy before CFOD. This is obviously not a fix, but if people are seeing the MTTF improve, that is one possible cause.

More updates as we get them.

BDub · February 12, 2014, 4:32pm

Great news @AndyW!

If anyone who is experiencing lock-ups with CFOD has an easy way to monitor uptime of some polled sensor data, I would be interested to see if my proposal for watchdog timer fixes helps you:

Note this code is basically the latest Core-Firmware Master with my changes, so any other CFOD fixes are not present. I think it's a fair test, and any additional CFOD fixes should make things more stable. The watchdog stuff just keeps the random weird bugs from completely crippling your application.

mohit · February 12, 2014, 8:19pm

@david_s5
I ran your binary on our office network, and soon enough the Core entered a stage of perpetual green flashing and never got back online. Are these binaries generated from the latest master branch on your repo?

I have pasted the log here: http://pastebin.com/zFi2vCMz

Please let me know if that helps.

sjunnesson · February 12, 2014, 8:27pm

As I told you earlier @BDub I have 7 cores running your modified watchdog and they are all above 24 hour uptime and ticking. Looking very promising for my intended use, a sensor that collects data and a Node server that polls that with a set frequency and saves it to Parse.

david_s5 · February 12, 2014, 9:00pm

@mohit

Failure mode looks different. Like the sparkprotocol was connected. The log does not look like it matches
0_inact_20_sec_recovery_log_core-firmware.zip’ bin

Please confirm it was http://nscdg.com/spark/0_inact_20_sec_recovery_log_core-firmware.zip

I am waiting on @zachary to do a merge with the timer tick changes. Then I will merg l push the core-firmware code

mohit · February 12, 2014, 10:40pm

@david_s5
To make sure, I re-flashed the Core and saved the logs again. The logs look different, but the behavior is the same. There was one CFOD recovery followed by a perpetual green flashing.

http://pastebin.com/ABETchNE

Thanks!

RWB · February 12, 2014, 10:57pm

@david_s5 @mohit , @BDub Not sure if this is helpful but I get the spark core lockup even when it shows a the breathing Cyan.

My Wifi lost is connection to the Net which caused the my main loop to freeze and the core must have froze also because kept showing the normal breathing Cyan.

I’m about to install BDubs watchdog timer so as I get a second to figure out how to do it but I have a feeling its going to eliminate these issues by auto resetting the core.

Video showing this

pra · February 12, 2014, 11:47pm

Here’s my thoughts on this problem

pdf

david_s5 · February 12, 2014, 11:56pm

@RWB @mohit @BDub - I have a major set of changes that stabilize the core. I will be merging and setting up a PR shortly. With it we should always know were the code is running if hung. All Faults yield a Panic code. RED LED SOS followed by the error code. It provides for orderly timeout on network calls with full logging. Stay tuned.

pra · February 12, 2014, 11:58pm

This link works in my browser

david_s5 · February 13, 2014, 12:01am

@mohit The logs still look ODD. How are you capturing the data? See the *** Is this a clue? Volatage OK?

0000063020:<DEBUG> virtual size_t UDP::write(const uint8_t*, size_t) (133):**(Clo)**
0000063028:<DEBUG> virtual size_t UDP::write(const uint8_t*, size_t) (135):**send9**
0000063038:<DEBUG> virtual size_t UDP::write(const uint8_t*, size_t) (136):**(Clo)**

RWB · February 13, 2014, 12:08am

Thanks for sharing! Hope it helps somehow.

david_s5 · February 13, 2014, 3:07am

This looks very detailed Thank you!

I will spend some time to see if the new spi driver (https://github.com/spark/core-common-lib) has mitigated all the issues you raised.

Thank you again! @pra

david_s5 · February 13, 2014, 3:09am

@mohit - All my changes are checked in and merged with the latest spark masters.

Please build and test with: https://github.com/spark/core-common-lib etal

Aloha David

wtfuzz · February 13, 2014, 3:57am

I have an adafruit breakout … I’ll fire it up when I get a chance.

Hypnopompia · February 13, 2014, 3:58am

Tonight I wrote a python script that can be used to force the CC3000 into CFOD and/or unresponsiveness using ARP flooding. This should help with testing as it speeds up the process of making the CC3000 puke instead of having to wait for it with normal network use. Maybe the TI guys can also use it when fixing the firmware? Anyway, I’ve run it quite a few times and it works 100% of the time and it knocks the CC3000 off the network almost immediately (but you can still ping the core). It doesn’t make the LED flash immediately, but the core cannot be accessed via the API. You’ll get a ‘timed out’ error.

If anyone has any other CC3000 modules besides the spark core, I’d like to hear if this script affects them. Such as the adafruit breakout board.

gist for cc3000flood.py

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1734	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	2974	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1764	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12325	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1810	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related Topics