Bug bounty: OTA flash reliability issue

Hi all,

We’ve recently uncovered a bug in OTA flash that causes OTA flashes to perform inconsistently; sometimes the Core freezes up right when the flash starts. We’ve got a stability fix for the CC3000 (to dramatically improve the issue often referred to as “Cyan Flash of Death”, or CFOD), but we can’t distribute it until this issue is fixed.

We’re going to be focusing our own resources on this ASAP, but if anyone from the community wants to jump in, we’ll offer a $100 bug bounty for any pull requests that lead to the resolution of the issue. This $100 will come in the form of credit on the Spark store (i.e. Spark Cores and accessories).

The issue is documented here: https://github.com/spark/core-firmware/issues/233

2 Likes

Cut & paste from en email I just sent:

[I] predict that OTA flashing turns out to be an excellent test case for the problem described here in the forums:

I have not yet dug into exactly what mutex is required around the SPI CS assertion, but there is enough smoke surrounding this issue to expect fire in the same area.

I actually don’t understand what the CC3000 driver would be doing in an irq context, since the driver seems fully polled; but if the flash driver uses interrupts, that will be enough to cause problems.

I fully expect that if you were to hang an SPI analyser off the bus when this problem happens you will see mangled SPI transactions that are resulting in the send() return value of zero.

1 Like

@AndyW, did you have a board with the SPI exposed?

I can wire up the same way @mohit did the other time for CFOD debugging but sadly, I don’t yet own a logic analyser.

Just a basic analyser for me

The problem as I see it is that the irq handler from the CC3000 is asserting it’s CS line on an interrupt. There’s no check if other devices (external flash) are using the spi bus concurrently. So if other devices are using it, toggling MISO etc… then the CC3000 will try to send data over the bus, causing data corruption.

Is there a test we can do to verify the presence/absence of the bug?

I don’t seem to have any OTA issues, so I might code a stress test to read from the net and read/write flash as the same time to see if I get corruption.

1 Like

It’s not quite that simple, but it is certainly not provably correct either.

The thing is that the CC3000 will only interrupt as a result of chip select being asserted, and the current host driver basicly blocks spinning on variables, so there is no forward progress in the foreground task while first the CC3000 /INT signal is handled (it is simply a handshake to say the SPI bus is ready, not a normal asynchronous interrupt), then the DMA completion interrupt is handled at the end of the SPI transfer.

The external flash code is vanilla polled pure-foreground code.

So while there is no obvious way for the code to do anything other than run one or other foreground code path to completion, there is zero protection against different/illegal behavior.

My personal suspicion is that the lack of a global mutex around the entire CC3000 and flash SPI accesses may allow other events (e.g. triggered by timer ticks etc) to interject themselves mid-transaction.

My current plan is to add simple spin-lock mutex code around the two competing SPI transactions, at the whole transaction level, and see if that remedies the problem. However, first I need to create a reliable test-bed to reproduce the OTA failure in captivity, and I won’t be able to start that for 10+ hours.

In the meantime, if someone can post a scriptable, simple, linux-based test that triggers this problem - it would be a great help to all working this issue.

EDITED TO ADD:

I stand corrected, there are completely unsolicited events (all housekeeping in nature, nothing useful like “hey, there’s data on this here socket”), which are signaled by the CC3000 pulling IRQ low, so these are perfectly capable of clashing with an ongoing flash operation (erase, read, write, verify etc) - effects may also involve unpredictable/bad behaviour from the CC3000, because the flash SPI access will be active when the CC3000 CS foes active…

1 Like

could it be something related to the position of 1’s and 0’s in the firmware.bin being transferred? i had an issue with particular firmware that wouldn’t flash ever, but adding an extra space in one serial print made it work… removing the same space made it fail again…

1 Like

What was the root cause of that ?

I would regard it as an epic fail if :spark: had an unreliable OTA mechanism, which had to be massaged by adding useless {padding,spaces,unicorn tears} to make it work again.

This is where we discussed it, first started happening about a month ago. the files are still on git now (backed up when it happened), just re-reading it i removed (commented/uncommented) the whole serial.print at line 105 to make it work/not work.. but either way something like that shouldnt make it be flash-able or not-flash-able

and this guy could possibly have the same or similar issue

If the Spark receives an unsolicited event from the CC3000 while doing a flash operation, the CS for the flash will be turned off and CS for the CC3000 turned on. This would arbitrarily terminate the flash operation midstream. The most likeliest unsolicited event to receive is a keep alive. You can change the event_mask in the spark to not include keep alive notifications. You can also turn keep alives off/on in NodeJS at the cloud. Try socket.setKeepAlive(false) prior to starting the OTA and see whether that helps.

The other possibility is buffer release notifications, which can’t be masked out, but they only occur after transmissions, and the Spark should be receiving not transmitting during OTA downloads. Unless you are sending back application level acks as part of the OTA protocol. If you are then you should consider changing the protocol. TCP will ensure what is sent is received.

Just looked at the WLAN code and keep alines are masked off, so they won’t be a problem. Have there been changes to the OTA protocol? If an ack is being sent back to the cloud, delaying the transmission of the ack until after the write to flash would probably suffice to overcome the problem

1 Like

@zach Ok, between line 965 and line 1042 of spark_protocol.ccp, there are acks transmitted before flash callbacks are utilized to perform flash operations. They are all commented as //Send Ack. One adds 2.04 to that comment. These ack blocks need to move to after the flash begin, update and finish callbacks so that the the unsolicited buffer release notifications don’t interfere with the flash operations.

I can’t give you a pull request because I don’t use this part of the spark firmware.

1 Like

Agreed - for the cases where the CC3000 is producing unsolicited messages, the CC3000 leads with the IRQ.

Disabling the CC3000 interrupt pin on the STM32 during the flash access code (with appropriate spl() [showing my age there] code) would prevent the CC3000 ISR from preempting the flash operation, and possibly mangling the CC3000 into the bargain.

Still looking for a good scriptable test case to use.

Disabling the interrupt pin under any circumstance with be bad news I feel. The CC3000 is very resource limited and very dependent on the Spark to clear out what it has in house. Slowing that service down will not help. May prove disastrous, as I imagine flash operations are pretty long winded (ms instead of us). Even delaying the buffer release notifications could result in ack transmission errors as the available buffer counts are not current.

Well then, rock meet hard place.

Mutex means someone’s going to have to wait, and the async nature of the CC3000 IRQ line in certain situations means that servicing that event will have to wait, unless you can predict the future (and code said time machine for an ARM without using up too much RAM.)

We can minimise the additional latency by locking at the finest granularity possible, but it remains to be seen how messy that gets.

1 Like

0r you can just move the acks to after the flash operations which should solve the problem without creating any new ones. Rule of thumb is don’t do TCP/UDP transmissions before flash operations. How can you use a mutex in an ISR anyway,

But we have no control over when buffer updates from the CC3000 are issued (to cite one counter-example.)

Well you do. They occur after a buffer is used by the CC3000 on a send or sendto and has been freed after the transmission has been completed (and I assume acked by the recipient). If you don’t transmit anything there will be no notification. What you don’t know is how long after a send will the notification arrive as that is totally network dependent.

In reality - having watched and decoded the SPI traffic, I would say the correlation is weak at the very best.

another way of addressing it is to set a flag in send and sendto that signifies that a buffer release notification is pending. You clear that flag in the event handler, which is running under a DMA interrupt, when the notification is processed. The Flash_CS_Low function in hw_config tests the flag and waits until the flag is clear. If it has control and the flag is clear all is good.

I still don’t think holding off the flash operations will help us here - that will be potentially a long delay and bear in mind that the current :spark: firmware is basicly single threaded, so you will spin there waiting for the CC3000 to free a buffer (which might take a while because of internet events beyond your control), meanwhile the :spark: cannot execute anything else useful.

I don’t see a free lunch here, we need mutex of some sort, we can haggle about where the endpoints of that are - they will all have different trade-offs.