Bug bounty: Kill the 'Cyan flash of death'

RWB · February 6, 2014, 8:24pm

tdicola - Great to see you here providing input. I see you on Adafruit’s Show and Tell almost every Friday was actually wondering if you had any of these issues on your Adafruit CC3000 projects. It looks like this CC3000 issue is wide spread and not isolated to the Spark Core its self.

Looking forward to seeing what you end up doing with the Spark Core.

zach · February 6, 2014, 11:25pm

Hey all - thanks for all the activity here and the help debugging. Thanks in particular to @david_s5 for the deep debugging and the pull request you’ve already provided; looking forward to incorporating your improvements.

In another thread, someone commented that the bug bounty we originally offered here is no longer really sufficient, which I fully agree with. This issue goes much deeper than we initially thought it would. Therefore we’re raising the bug bounty to $500 of Spark stuff for a fix, and any pull requests with improvements in reliability will get $100 of Spark stuff. I’ll change the first post to incorporate these changes.

Don’t know what you would do with $500 of Spark stuff? Happy to send Spark Cores directly to a bunch of your friends. It’s the least we can do to thank you all for the input and support you’ve provided here.

david_s5 · February 7, 2014, 1:36am

Thank you @zach

Here is another set of improvements. That has recovery code in it to detect the CFOD and does a reconnect in about 20 second total. So if the CFOD only happen once and while, this will recover fast.

http://nscdg.com/spark/0_inact_20_recovery_core-firmware.zip

It still sends a udp packet every 60 seconds to 10.10.0.1 and does the same reading from the http get.

All the log data is critical, because the faults are logged after being detected and mitigated so they do not sent the CPU into a hardfault. Please send the logs and LED observations. (if it dies in a red ...---... Blink N ...---...)

1 (Faults,RGB_COLOR_RED,HardFault) 1
2 (Faults,RGB_COLOR_RED,NMIFault)
3 (Faults,RGB_COLOR_RED,MemManage)
4 (Faults,RGB_COLOR_RED,BusFault)
5 (Faults,RGB_COLOR_RED,UsageFault)

6 (Cloud,RGB_COLOR_RED,InvalidLenth)

7 (System,RGB_COLOR_RED,Exit)
8 (System,RGB_COLOR_RED,OutOfHeap)
9 (System,RGB_COLOR_RED,SPIOverRun)

10 (Softare,RGB_COLOR_RED,AssertionFailure)
11 (Softare,RGB_COLOR_RED,InvalidCase)

ravenet · February 7, 2014, 7:29am

Assuming the temp workaround already pushing out when we re-flash our cores, mine seems to be worse now than previous. It used to be that it would die sometime overnight, but now it’s almost immediate. Sometimes it will recover only to die again seconds later.
My little test app is nothing more than a check temp, flash led, wait a second, flash another led, wait a second, rinse and repeat.

wtfuzz · February 8, 2014, 8:57pm

Another interesting observation. I added ICMP pings to the gateway address before attempting to connect, and it looks like the CC3000 was able to send packets (ARP cache had the MAC of the GW), however it was not responding to ARP who-has requests from the GW.

09:05:30.717080 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 43776, length 16
09:05:30.719853 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:05:31.717048 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 44032, length 16
09:05:31.719853 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:05:32.719849 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:05:50.716599 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 43776, length 16
09:05:50.719854 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:05:51.717048 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 44032, length 16
09:05:51.719852 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:05:52.719847 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:10.721560 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 43776, length 16
09:06:10.723854 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:11.716093 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 44032, length 16
09:06:11.723852 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:12.723850 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:30.717560 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 43776, length 16
09:06:30.719853 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:31.715619 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 44032, length 16
09:06:31.719847 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:32.719846 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:50.723033 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 43776, length 16
09:06:50.723851 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:51.718373 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 44032, length 16
09:06:51.723849 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
09:06:52.723828 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28

RWB · February 8, 2014, 9:03pm

So I can attempt to run this improved firmware and provide feedback. Please link me to the best instructions or video on how to update firmware on a Windows 8 64bit if possible.

david_s5 · February 8, 2014, 10:25pm

Have a look at https://github.com/spark/core-firmware#4-flash-it

david_s5 · February 8, 2014, 10:55pm

Updated:

Use http://nscdg.com/spark/0_inact_20_sec_recovery_log_core-firmware.zip

It ran for me for over 48 hours with no hard faults!

It will Log CFOD;s as

0000164364:<ERROR> void SPARK_WLAN_Loop() (582):Resetting CC3000 due to 2 failed connect attempts

david_s5:

Here is another set of improvements. That has recovery code in it to detect the CFOD and does a reconnect in about 20 second total. So if the CFOD only happen once and while, this will recover fast.

use like above

It still sends a udp packet every 60 seconds to 10.10.0.1 and does the same reading from the http get.

All the log data is critical, because the faults are logged after being detected and mitigated so they do not sent the CPU into a hardfault. Please send the logs and LED observations. (if it dies in a red ...---... Blink N ...---...)

1 (Faults,RGB_COLOR_RED,HardFault) 12 (Faults,RGB_COLOR_RED,NMIFault)3 (Faults,RGB_COLOR_RED,MemManage)4 (Faults,RGB_COLOR_RED,BusFault)5 (Faults,RGB_COLOR_RED,UsageFault)

6 (Cloud,RGB_COLOR_RED,InvalidLenth)

7 (System,RGB_COLOR_RED,Exit)8 (System,RGB_COLOR_RED,OutOfHeap)9 (System,RGB_COLOR_RED,SPIOverRun)

10 (Softare,RGB_COLOR_RED,AssertionFailure)11 (Softare,RGB_COLOR_RED,InvalidCase)

RWB · February 9, 2014, 12:33am

From my testing the Loop stops running once the CC3000 locks up. I’m not sure if a Pin High will go low once the Spark Core freezes due to CC3000 issues.

See this thread where we tried to reset the Spark Core every 10 mins. The loop quit running when the CC3000 froze up though the reset never happens.

Sparky · February 9, 2014, 1:51am

Figured it out! Thanks RWB. I have created a temporary solution that resets the Spark only if the program stops running from being disconnected to the cloud. My CFOD Killer using an ATTiny85. https://community.spark.io/t/cyan-flash-of-death-killer/2713

sjunnesson · February 9, 2014, 10:09pm

@zach and @zachary is there any updates on the conversation with TI regarding this bug? The TI thread seems to be dead for quite some time now.

mohit · February 9, 2014, 10:21pm

We had a conference call with TI and their tech support team this Friday. They asked us to generate more debug logs for them to decipher. Earlier logs did not yield anything useful. @AndyW has been diligently following up with them with new debug reports. We explained exactly how the CC3000 behaves under CFOD (the module seems okay on its SPI side but its radio goes silent)

The logs include simultaneous capture of the SPI traffic, TI’s debug lines and pcap files. This week we are hoping to hear back from them with something substantial.

bko · February 9, 2014, 10:56pm

I want to start off by saying that I really appreciate all the work that both the team and the die-hards in community are doing to help debug this. It is a really collaborative, team effort to fix a complicated bug.

I have always wondered why busy networks see the CFOD with much more frequency that non-busy networks. I have only seen it once or perhaps twice, but my cores are on a private AP with almost no other traffic now.

I thought that @wtfuzz had an interesting log above. The core answers ICMP echo (ping) requests but not ARP who-has requests. This is a funny state to be in, but one I have seen before in other implementations when the local ARP cache is hosed and you no longer have the MAC and IP address for the gateway cached. Ping can continue to work because you can just “turn-around” the addresses in the packet you receive, which lots of implementations do for speed. Reading through the driver code, I see there is a netapp_arp_flush(); function available.

Has anyone experimented with flushing the ARP cache periodically to see how that might effect CFOD?

Also, can anyone confirm that you can ping a core in the throes of CFOD? That would be a big clue.

Just a random thought. Good luck with the hunt!

wtfuzz · February 10, 2014, 1:07am

The core is actually sending ICMP in this case (core IP is 10.78.100.176), but refusing to answer to the ARP from 10.78.100.1, which is trying to send an echo response when the MAC has expired from its (Linux router) table.

I have been able to ping the core for a short period after CFOD.

I've also seen SYN packets go out to the cloud, a SYN+ACK comes back, but no ACK is sent back from the CC3000.

Very strange behavior I'll probably stick an arp flush in and see what happens.

wtfuzz · February 10, 2014, 1:11am

ARP Flush added. Let’s see what happens…

0000001542:<DEBUG> void WLAN_Async_Callback(long int, char*, unsigned char) (311):HCI_EVNT_WLAN_UNSOL_CONNECT
0000002502:<DEBUG> void WLAN_Async_Callback(long int, char*, unsigned char) (347):HCI_EVNT_WLAN_UNSOL_DHCP
0000002603:<DEBUG> void SPARK_WLAN_Loop() (618):Flushing ARP cache
0000002609:<DEBUG> void SPARK_WLAN_Loop() (620):ARP flush done
0000002615:<DEBUG> void print_address(const char*, const unsigned char*) (96):IP: 10.78.100.176
0000002623:<DEBUG> void print_address(const char*, const unsigned char*) (96):Netmask: 255.255.255.0
0000002632:<DEBUG> void print_address(const char*, const unsigned char*) (96):Gateway: 10.78.100.1
0000002641:<DEBUG> void print_address(const char*, const unsigned char*) (96):DHCP Server: 10.78.100.1
0000002650:<DEBUG> void print_address(const char*, const unsigned char*) (96):DNS Server: 10.78.100.1
0000002659:<DEBUG> void print_ipconfig() (115):MAC: 08:00:28:58:e0:8e
0000002665:<DEBUG> void print_ipconfig() (118):SSID: Fuzz
0000002670:<DEBUG> void SPARK_WLAN_Loop() (626):ICMP
0000002675:<DEBUG> void SPARK_WLAN_Loop() (634):ICMP done 0
0000002681:<DEBUG> int Spark_Connect() (620):SparkSocket 1
0000002686:<DEBUG> int Spark_Connect() (640):Connecting
0000002754:<DEBUG> int Spark_Connect() (642):Connect returned 0
0000002760:<DEBUG> void SPARK_WLAN_Loop() (698):HANDSHAKE

wtfuzz · February 10, 2014, 1:15am

Under working conditions, I’m seeing ARPs quite often from my linux box actually… I may try some packet generation as well to see what I can break

18:55:56.120416 ARP, Request who-has 10.78.100.1 tell 10.78.100.176, length 46
18:55:56.120425 ARP, Reply 10.78.100.1 is-at 00:0b:e0:f0:00:ed, length 28
18:55:56.123729 IP 10.78.100.176.1709 > 54.208.229.4.5683: Flags [S], seq 248064626, win 1460, options [mss 1460], length 0
18:55:56.177638 IP 54.208.229.4.5683 > 10.78.100.176.1709: Flags [S.], seq 1696023395, ack 248064627, win 17922, options [mss 8961], length 0
18:55:56.180841 IP 10.78.100.176.1709 > 54.208.229.4.5683: Flags [.], ack 1, win 1460, length 0
18:55:56.234727 IP 54.208.229.4.5683 > 10.78.100.176.1709: Flags [P.], seq 1:41, ack 1, win 17922, length 40
18:55:56.238831 IP 10.78.100.176.1709 > 54.208.229.4.5683: Flags [.], ack 41, win 1460, length 0
18:55:56.312492 IP 10.78.100.176.1709 > 54.208.229.4.5683: Flags [P.], seq 1:257, ack 41, win 1460, length 256
18:55:56.366400 IP 54.208.229.4.5683 > 10.78.100.176.1709: Flags [.], ack 257, win 18760, length 0
18:55:56.388865 IP 54.208.229.4.5683 > 10.78.100.176.1709: Flags [P.], seq 41:425, ack 257, win 18760, length 384
18:55:56.392941 IP 10.78.100.176.1709 > 54.208.229.4.5683: Flags [.], ack 425, win 1460, length 0
18:55:56.698207 IP 10.78.100.176.1709 > 54.208.229.4.5683: Flags [P.], seq 257:275, ack 425, win 1460, length 18
18:55:56.752075 IP 54.208.229.4.5683 > 10.78.100.176.1709: Flags [P.], seq 425:443, ack 275, win 18760, length 18
18:55:56.755399 IP 10.78.100.176.1709 > 54.208.229.4.5683: Flags [.], ack 443, win 1460, length 0
18:55:57.104538 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 43776, length 16
18:55:57.104571 IP 10.78.100.1 > 10.78.100.176: ICMP echo reply, id 52480, seq 43776, length 16
18:55:58.105719 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 44032, length 16
18:55:58.105745 IP 10.78.100.1 > 10.78.100.176: ICMP echo reply, id 52480, seq 44032, length 16
18:56:01.175848 ARP, Request who-has 10.78.100.176 tell 10.78.100.1, length 28
18:56:01.191378 ARP, Reply 10.78.100.176 is-at 08:00:28:58:e0:8e, length 46

bko · February 10, 2014, 2:11am

If flushing the ARP cache has a good effect, I wonder if we can add static ARP table entries? In the old days when things like IP cameras first came out, you often had to force a static ARP table entry for them since they didn’t support ARP at all.

Thanks @wtfuzz for trying out my crazy idea!

wtfuzz · February 10, 2014, 2:27am

@bko no way I can see in the TI API to control the ARP cache. Just flush it

Interestingly, I’ve seen Spark_Connect() fail, Internet_Test() fail, it flushed the ARP cache and the next one succeeded. I wouldn’t get too excited yet though

0004397881:<DEBUG> int Spark_Connect() (642):Connect returned -1
0004397886:<DEBUG> void SPARK_WLAN_Loop() (670):Internet_Test
0004397892:<DEBUG> int Internet_Test() (604):Connecting socket 0
0004400780:<DEBUG> void WLAN_Async_Callback(long int, char*, unsigned char) (373):HCI_EVNT_WLAN_KEEPALIVE
0004400780:<DEBUG> void WLAN_Async_Callback(long int, char*, unsigned char) (374):Expecting opcode 0x1007
0004410763:<DEBUG> int Internet_Test() (606):Result -1
0004410768:<DEBUG> void SPARK_WLAN_Loop() (682):nvmem_write
0004410889:<DEBUG> void SPARK_WLAN_Loop() (618):Flushing ARP cache
0004410895:<DEBUG> void SPARK_WLAN_Loop() (620):ARP flush done
0004410901:<DEBUG> void print_address(const char*, const unsigned char*) (96):IP: 10.78.100.176
0004410909:<DEBUG> void print_address(const char*, const unsigned char*) (96):Netmask: 255.255.255.0
0004410918:<DEBUG> void print_address(const char*, const unsigned char*) (96):Gateway: 10.78.100.1
0004410927:<DEBUG> void print_address(const char*, const unsigned char*) (96):DHCP Server: 10.78.100.1
0004410936:<DEBUG> void print_address(const char*, const unsigned char*) (96):DNS Server: 10.78.100.1
0004410945:<DEBUG> void print_ipconfig() (115):MAC: 08:00:28:58:e0:8e
0004410951:<DEBUG> void print_ipconfig() (118):SSID: Fuzz
0004410956:<DEBUG> void SPARK_WLAN_Loop() (626):ICMP
0004410961:<DEBUG> void SPARK_WLAN_Loop() (634):ICMP done 0
0004412978:<DEBUG> int Spark_Connect() (620):SparkSocket 0
0004412983:<DEBUG> int Spark_Connect() (640):Connecting
0004413047:<DEBUG> int Spark_Connect() (642):Connect returned 0
0004413052:<DEBUG> void SPARK_WLAN_Loop() (698):HANDSHAKE
0004413589:<DEBUG> void SPARK_WLAN_Loop() (700):Result: 0

And we can see that the CC3000 did an ARP request after the flush

20:11:25.407296 ARP, Request who-has 10.78.100.1 tell 10.78.100.176, length 46
20:11:25.407309 ARP, Reply 10.78.100.1 is-at 00:0b:e0:f0:00:ed, length 28
20:11:26.406944 IP 10.78.100.176 > 10.78.100.1: ICMP echo request, id 52480, seq 43776, length 16
20:11:26.406987 IP 10.78.100.1 > 10.78.100.176: ICMP echo reply, id 52480, seq 43776, length 16
20:11:26.443948 IP 10.78.100.176.1874 > 54.208.229.4.5683: Flags [S], seq 1284853186, win 1460, options [mss 1460], length 0
20:11:26.499372 IP 54.208.229.4.5683 > 10.78.100.176.1874: Flags [S.], seq 3173016529, ack 1284853187, win 17922, options [mss 8961], length 0
20:11:26.503090 IP 10.78.100.176.1874 > 54.208.229.4.5683: Flags [.], ack 1, win 1460, length 0

bko · February 10, 2014, 2:53am

I didn't see a way to control the ARP cache either, other than flush.

And I am not too excited: Best-case, this might be part of a much larger work-around and pointer to TI to look at this area. It would seem from reading everyone's findings here that there are plenty of bugs to go around.

It is just that this is the first thing that makes some sense of the fact that busy networks CFOD more than lightly loaded networks. Busy networks have more ARP traffic and more chances for ARP problems.

AndyW · February 11, 2014, 2:33am

There is definitely some sensitivity around ARPs in the CC3000.

One of the first steps they requested was modifying the ARP timeout to 0 (aucARP = 0 in core-firmware/src/spark_wlan.cpp), which is supposed to never age an entry out of the ARP cache.

However, even with that change made, situations arise where the CC3000 ARPs for the default gateway.
In my testing these events do not correlate with the CFOD, but they do correlate with other badness such as http://e2e.ti.com/support/low_power_rf/f/851/t/318100.aspx

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1738	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	2975	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1767	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12333	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1813	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related Topics