Bug bounty: Kill the 'Cyan flash of death'

BDub · February 15, 2014, 4:59am

Ok the max IWDG can be is 26.208 seconds with the Prescaler set to 256 and the Reload value set to 0x0FFF.

I think the IWDG should be set higher than all of the other SW timeouts, only if those software timeouts result in calling NVIC_SystemReset(); that is the only thing besides IWDG that can get the system out of a stuck state.

So if you are telling me that all of the SW timeouts you have implemented are set to about 20 seconds, I think that works out well. However it only gives the Core 20 seconds to connect to WLAN and/or connect to the CLOUD and/or perform the HANDSHAKE… I guess this seems like enough time… and if not, a reset gives it 20 more seconds! lol.

The other case to question is the SmartConfig which can take a long time to complete… so maybe the watchdog would need to be kicked in that loop a couple times. It’s a special mode anyway, and as long as that BLUE led is blinking… the interrupts are still working and a counter can be running to ultimately timeout after something like 2 minutes? If the SmartConfig loop locked for some reason, a solid blue LED or blank LED would likely indicate to the user they need to reset, but the IWDG would also kick in under those conditions.

I think we should let the user code run up to 26.208 seconds… unless they want to kick the watchdog, AND kick the WLAN loop from their code, then they can run as long as they want… I think.

AndyW · February 15, 2014, 5:00am

Correct - I had a typo in the subnet, so it looks like the CC3000 was ignoring the ARP responses.

Fixed that, and it causes problems, but it behaves differently when directed at two different cores.

Core connected to the production Spark servers, it will either be benign or cause CFOD. Looking at the tcpdumps, it is either completely ignored, or it causes the CC3000 failure behind the CFOD.
Core connected to Spark’s staging server. It will cause the deadly embrace the first time it is run, after that, the effect is benign. I have yet to drive this core to CFOD using this script (although it does hit CFOD eventually of it’s own accord.)

Both cores are running identical firmware (which does not have the restart code, because I need the cores to stop on CFOD, not recover by restarting.)

I’m trying to wrap my head around this behaviour, but it is confounding.

I have provided the python and this information to TI, will report back their response.

david_s5 · February 15, 2014, 3:27pm

I think we can work with that but even 3 sec will work.....see below

The only other timeout (in my fork) has to do with timing the flash and it is the only one that calls NVIC_SystemReset(). I need to understand what it does and see if it an appropriate solution. (it may be a dead wait in serialflash driver for the st25 I fixed the nuttx version of that driver so I will take a look when time permits)

Here is an idea. If we except 3 assertions: 1) the WD lives in the common-lib, 2) system code (spark 's rewritten cc3000, and st25) is vetted. 3) If cloud connected - the Sparkprotocol has to run once every 30 seconds

Then:

User code should not run more then 15 Seconds (with not network/flash ops) with out calling
checkin().

The Watch dog gets kicked in the main's call to loop(), the NEW hci_event_handler while timing and in the st25 (if needed).

Also consider the arduino paradigm need a slight shift to run on "Fractionally Connected Device" - This can be a very elegant and yet simple solution if done right and will have minimal impact of the setup()/loop() paradigm

Non issue because of the way the new NEW hci_event_handler works.

I feel the spark's Wlan SmartConfig/Connect UI is lacking an easy understanding of what is going on. For another project I have already designed and implemented a better smartconfing with a user feed back system for the Wlan connectivity. So we can look at the two and see if we can not improve things.

Agreed but let's use the notion of a checkin API checkin({start|disable|enable|done},[expect to run X ms]), and keep the real IDW abstracted from the user space (spUS). the sPOs (I need names for the spark code and user code) will handle the rest.

Just my thoughts before coffee....

BDub · February 15, 2014, 3:49pm

I don't see where you explain how this will be ok.

This is a problem... if something can't connect to the WLAN for 12 hours... I think we should have tried a reset at some point way sooner.

My hope is that very soon all Spark code is vetted and working properly, but until then you need a safety net. Something that without a doubt kicks the Core in the pants and gets it working again. Otherwise we will continue to have more and more threads popping up with people saying they are going back to just coding on Arduino because it just works and stays working.

I guess I don't see how if user code is limited to 15 seconds, the IWDG can be reloaded every 3 seconds... how?

I put it just outside the main's call to loop(), so that it also gets kicked after setup() runs once. Setup() can block for quite a while if you do things that wait for user input to open a Serial connection. while(!Serial.available()); I like to use this to help lock-step-open my serial terminals.

I understand how the main() loop executes... but can you explain the timing and sequence of "the NEW hci_event_handler while timing and in the st25"

I don't know what you mean by this last part "Fractionally Connected Device" ... I agree the IWDG needs to be kicked in and around the setup()/loop() in case you have the WLAN diabled.

Are you agreeing to 26.208 seconds now? What about 15 seconds and 3 seconds... As for the API that's fine as long as it can't be mishandled by the user too easily.

david_s5 · February 15, 2014, 5:13pm

System integrity is complicated and it is compounded by a device that can sleep and is communicating over a network with not a 100% up time.

Sorry for not painting a complete picture.

Let's plan on developing a UML:State diagram of the operation states expected of the device then overlay the IWD usage on it.

3 seconds assume that the user code is using the facilities of the sPOs Not doing a while(1) ;

There are 2 levels of cannot connect. Spark Protocol and user code:

Spark Protocol will reset the CC3000 if it can not connect. The is now built in when all the PR are merged to master..

User code should have the option of calling System.Reset() if it deems it's target is unreachable. Or we can add an observer for the network code that can do it

If a HW WD is not implemented correctly it will guarantee frustration and failure of a whole other realm.
At a minimum WD failure needs to capture and log the cause and decide ir a reset is needed.

I am not sure if the STM32 has the ability to RESET only or vector to an ISR on WD timeout (ala the MSP430). But the same thing can be done with an ISR off a timer and NVIC_Reset() with a IWDT as a back up.

This is just an example of one "facilities of the sPOs" I am assuming the proper initialization and use of the WD timer in the code running outside of the boot loader.

For the spark cloud to work. The yield back to loop's caller has to occur to allow the spark protocol meet its 30 second deadline. During time in the loop() we assume 1 of 2 things. 1) user code is using
facilities of the sPOs. or 2) the code does checkins.

A device that sleeps, and/or communicates over a communication medium that is not 100% connected.

This is why we need a UML:State diagram

BDub · February 15, 2014, 6:18pm

What is the sPOs? I don't think we should make it mandatory for users to have to call checkin() to keep the watchdog reset. This is a huge burden for someone that just wants to create an LED that blinks on and off at a 0.1Hz rate (5 sec on/ 5 sec off).

Agreed

Part of the reason why this all needs to work autonomously in the background is nobody wants to initially think about these timing constraints. They just want to code something up that works, period. If it doesn't work, they need to be notified why not... so if they block in their code for too long, the system shouldn't just hang and then drop off the cloud like it currently does, reconnect and sit there idle not running the user code. It should automatically reset itself and alert the user through the RGB led some color combination that says "something's different... go figure out what it is!" I had proposed changing the Cyan breathing to Red breathing that you can clear in user code... but I think a nicer presentation would be just adding a Red Blip at the end of the Cyan Breathing cycle (when it fades out). It could blink Red for each time a IWDG has occured... maybe up to 3 - 5 times. More than that and you really don't care how many times it's reset... just assume a lot.

Once they go read about why these problems may occur, they can get smarter about how they write their code to avoid the resets... but at least the reset will attempt to get their code up and running again. Maybe their code doesn't block right away.. perhaps it's once every hour there is a bad calculation that results in an infinite loop. This reset would be a great safeguard to keep the system up and running.

The other thing you can't assume, is that current and future Core Firmware will be perfect. Code can lock up anywhere, even ISRs. The IWDG is the one and only thing that can get you out of a jam.

Also I don't know why you would want to make the IWDG call a function before resetting the system, that kind of defeats the purpose of it being a dedicated hardware reset. Like you said if you need to run some code before reset, you can use ISRs, counters and NVIC_SystemReset() in the handler.

RWB · February 15, 2014, 7:43pm

@BDub @david_s5

Without BDub’s current Watchdog I’m running the Spark Core would be worthless to me. If I wouldn’t have figured out how to emplement it I would have thrown the Spark Core to the side and came back months later to see if the problem had been fixed. Instead I just ordered 2 more units.

Even when the firmware improves I can’t ever see the Spark Core being reliable without this WatchDog always standing guard. Its allowing my Spark Core to remain connected to the net without any user interaction even though its resetting the core often.

david_s5 · February 15, 2014, 8:10pm

@BDub

If the user is toggling an led there using facilities of the sPOs and do not need to check in.

david_s5 · February 15, 2014, 8:19pm

@RWB got it! I am not arguing against @Bdub’s WDT… or the use of a HW WDT…Just trying have it do its job without being a burden in ANY use case.

So the spark build you are running on now has issues. See http://nscdg.com/spark/David_Sidrane’s_rework_of_the_TI_CC3000_driver_for_Spark.pdf

but keep in mind that the master repos have undergone massive reliability re-work and when all the PRs are merged (next week hopefully) QA-ed and release I think you will be pleased with the way the core behaves when things go wrong. And if not, please fill me in on what does not work and I will try to fix it.

RWB · February 15, 2014, 9:05pm

I have finally figured out how to load new firmware on the Spark Core so I can try any new upgrades now if you have them and then report back.

david_s5 · February 15, 2014, 9:13pm

Do you want to build it and run your app or just run my bin app that does an http get, udp sent to and recvfrom?

RWB · February 15, 2014, 9:15pm

Ideally run the improved firmware with my current sketch that is sending data to Xivley constantly. I basically just monitor this thing all day while I do other stuff plus Xivley graphs show me exactly when a disconnection happens and for how long.

david_s5 · February 15, 2014, 9:21pm

@RWB You can send me your sketch and I will build it or pull the dev tools and you can build it?

sjunnesson · February 15, 2014, 10:20pm

if you need things tested @david_s5 Im more then happy to load up 5 Cores or so with code and let them sit for some days. Just point to a repo and I will build locally and flash the firmware.

david_s5 · February 15, 2014, 10:39pm

Sure!

@sjunnesson

use github spark/* master

mege into spark master

https://github.com/davids5/core-firmware.git branch redundant_timers_take3
https://github.com/davids5/core-common-lib.git branch redundant_timers_take3
https://github.com/davids5/core-common-lib.git branch interrupt_safe_flag_checking

david_s5 · February 16, 2014, 2:30am

When in net beans you are building the bin from source. But unless you pulled from my repos ( see above) your are not running the new code. The bin files I sent were your app in the new code. If you have a hex editor you can edit the credentials or check out the repos and merge them. Either way there there is not Bdubs code unless you add it

RWB · February 16, 2014, 4:12am

Yea I can do it if I have the following:

core-firmware
core-common
core-communication

I do not understand how to merge your linked libraries with sparks master libraries. What does that mean? How do I do it?

I’m following these step by step instructions on this thread, it worked fine for BDubs WD hack. https://community.spark.io/t/how-to-video-for-compiling-locally-in-windows/2457

david_s5 · February 16, 2014, 3:14pm

@RWB

have a read on http://labs.kernelconcepts.de/downloads/books/Pro%20Git%20-%20Scott%20Chacon.pdf

david_s5 · February 16, 2014, 10:36pm

@sjunnesson, @RWB and All

To make life simpler for anyone who would like to test my latest versions with all the bug fixes Incorporated, I have created an interim branch called spark_master_new in both repositories

it comprises spark/master with all my PR that are outstanding against that version of master.

These branches will have my latest changes that differ from the current spark/master for the 2 repos.

core-firmware has PR 81,82:
PR 81 is critical to proper network operation and recovery and PR82 fixes bugs in UDP, TCP

core-common-lib.git has PR 14, 11
PR14 is clean up, and PR11 is critical or interrupt safe operations on the socket status

UPDATED had typos in second url

RWB · February 16, 2014, 10:46pm

@david_s5 I looked over the Pro Git manual you linked to but its a bit to much for me to process right now LOL

The only way I now how to load new firmware is to by following the example video I linked to above which requires:

Core-Firmware
Core-Communication
Core-Common

If I can’t just download those folders that are ready to go from Git then I’m just going to have to wait till it can be applied via the Spark IDE update. Its the Branching and Merging that just leaves me lost

If you can provide links to the folders above then I can follow the steps in the Firmware walk through video I linked to above. If thats to much trouble then I have no problem waiting for the new updates to be released to everybody.

Thanks for all your efforts!

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1734	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	2974	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1764	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12324	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1809	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related Topics