Can the Spark Core trigger the Reset Pin?

I should be clear… the code always gets past the IWDG setup process, and starts the slower D7 blink for 5 seconds, but if the Reload value is too high, when it gets to the fast D7 blink stage it just blinks there forever instead of the watchdog timer counting down to zero. I don’t see why this should happen… this is by far the most complicated watchdog timer I’ve encountered… and by the looks of it it’s really not that complicated.

OK SUCCESS!

So my main goal in putting all of the code in setup() was to avoid any background tasks from mucking with things while we weren’t looking. As I’m testing this I’m staring at the breathing cyan LED while the Core is in the hard loop waiting to watchdog timeout… and it occurs to me “how the hell is that still breathing cyan in this hard loop??”… hah, interrupts!

So I added a __disable_irq(); at the beginning of setup() and started to have problems. Turns out delay() works off of millis() and mills() works off of interrupts. So I tried delayMicroseconds() and micros() and thankfully they worked!

Good news out of this is that now the Reload register can be set to 0x0FFF and it works perfectly!

With a prescaler of 64, I get a max time of about 6.6 seconds with my stop watch (should be 6.55 nominal … IF the LSI clock was perfect, so I’d say pretty damn good). With a prescaler of 256, I get a max time of about 26.8 seconds (should be 26.2)

So now it’s just a matter of figuring out what the interrupts are doing… my suspicion is that they are reloading the IWDG counter after about 2 seconds. So if we try to watchdog time out before that, it works… but if we try any time after that… we are allowing some other code to reset the watchdog. I searched for IWDG_ReloadCounter();
and it’s not being used anywhere but our code and main.cpp but I have it commented out there. So there are other low level ways … gotta search for those tomorrow. Bed time!!

Here’s the code so far :slight_smile: Keep in mind once IRQ is disabled the RGB can be off or on… just pay attention to the D7 led.

#include <application.h>

uint32_t lastReset = 0; // last known reset time
bool s = true;

void setup()
{
  __disable_irq(); // no funny business!

  // Watchdog setup
  IWDG_WriteAccessCmd(IWDG_WriteAccess_Enable);
  // check that we flag reset before writing Prescaler
  while (IWDG_GetFlagStatus(IWDG_FLAG_PVU) == SET) {
    // Wait until hardware is ready 
  }
  
  //IWDG_SetPrescaler(IWDG_Prescaler_64);
  IWDG_SetPrescaler(IWDG_Prescaler_256);
  
  // check that we flag reset before writing Reload value
  while (IWDG_GetFlagStatus(IWDG_FLAG_RVU) == SET) {
    // Wait until hardware is ready 
  }
  
  //IWDG_SetReload(0x0FFF); // 0x0FFF = 6.55 seconds (prescaler of 64)
  IWDG_SetReload(0x0FFF); // 0x0FFF = 26.2 seconds (prescaler of 256)

  IWDG_WriteAccessCmd(IWDG_WriteAccess_Disable); // block lockdown
  IWDG_ReloadCounter(); // kick the dog for the first time
  IWDG_Enable(); // should probably not be needed since we are setting the flags above and then disabling the write access but it doesnt hurt...   

  pinMode(D7, OUTPUT);
  
  lastReset = micros(); // We just powered up    
  while( (micros()-lastReset) < (5*1000*1000) ) {
    // Blink the LED for 5 seconds so we 
    // know the code is running...
    digitalWrite(D7,s);
    s = !s; // toggle the state
    delayMicroseconds(100000); // makes it blinky
    IWDG_ReloadCounter(); // reloads the counter and kicks the dog forward
  }
  while(true) { // kill it!
    digitalWrite(D7,s);
    s = !s; // toggle the state
    delayMicroseconds(30000); // makes it blink faster!!! 
  }
}

void loop() {
  // do nothing, never going to get here anyway!
}

Gotta Love Good News! :smile:

Ok I tracked this thing all over through the code and landed back in main.cpp, lol…

So the question was, how is the reload register getting reloaded via interrupts being enabled?

Answer is this interrupt handler calls a function in main.cpp every 1ms:
https://github.com/spark/core-firmware/blob/master/src/stm32_it.cpp#L162-L174

Which contains this code at the end of it:
https://github.com/spark/core-firmware/blob/master/src/main.cpp#L349-L363

The core firmware originally tries to set the Reload value to 3 seconds, and this routine resets it to 3 seconds after 1 second has elapsed. So it’s continually decrementing from 3 seconds left to 2 seconds left, then repeating. A good safe margin away from 0 seconds left and resetting the micro via watchdog.

So we have all of the tools now… just need to re-write main.cpp a bit to get the functionality we want in user code for the watchdog.

Honestly, I think it’s a really bad idea to keep your watchdog alive via interrupt service routine! It’s basically guaranteed to never fail!

The watchdog needs to be kicked (reset) in areas of code where execution should be regularly happening, and not because of an interrupt timer.

Also the watchdog needs to be enabled all of the time… and it appears to be when DFU_BUILD_ENABLE is defined, which definitely happens in the local build environment and from the web IDE (aka Sparkulator), but does so from the MAKE file. Seems like an ok place to define it, but requires you to #undef it in code if you need to for debugging. Is that the proper way? @Dave

3 Likes

Hmm, this might be a good place for @zachary or @satishgn to jump in, since they designed the firmware layout

1 Like

Yeah, we just normally comment out the #definition when debugging, so undef is fine. Point taken on not keeping it alive via IRQ.

Great thread guys. Nice work…

Got my cores last week. Was using api.spark.io to get the variables I wanted, but between the core itself CFOD and the cloud just getting twittchy a little too often, I had to give up polling the api to get my data.

Modified my core to send its data via UDP to another process on my network (thinking just flicking a UDP packet every second should be more reliable and robust than going out to the cloud every 5 seconds).

You guessed it, woke up to a CFOD this morning.

So I was thinking about how to reset the core on a frequent basis to keep it going… which straight away lead me to this thread.

After reading this thread a few times, I’m just not sure what is the best option.
Can someone please sum it up, do we have a software reset chunk of code we can put in our code, or should I look at a 555 to do the reset every few hours?

Really need to get this working, have invested too much time and money in my 5 cores for this to not work.

Thanks.

I’m leaning towards waiting alittle while longer to see if they get a reliable reset working without any external equipment needed. That’s the ideal solution.

If you don’t want to wait then just use the 555 timer or ATTiny 85 and you will have the fix early.

My core has been sending data to Xivley for 15 hours straight now with the stock Spark Core. With a 555 timer resetting it regularly it would never go down.

@zachary Ok cool, hopefully that didn't sound harsh in bold with exclamation... didn't mean it to be :wink:

I'm going to take a stab :hocho: at this:

...in addition try to figure out a new spot to reload the IWDG counter in main.cpp.

Then maybe some example code to show how to detect a IWDG reset ( and as long as you're detecting it, might as well log it and change the breathing color back to cyan ).

Also some example code to shorten the IWDG timeout value when user code runs, just in case someone needs to start recovering from a lockup faster than 26.2 seconds (this is of course at the expense of requiring a faster loop() execution to kick the dog :dog:, or show how the dog can be kicked in user code with longer loops). ... but still keeping in mind the longest loop execution is 10 seconds, or the Cloud :cloud: turns gray and starts raining :sweat_drops: cheeseburgers large enough to crush :house_with_garden: mobile homes. Mmmm, cheeeeeseburger :hamburger:

3 Likes

I’ll buy ya a Cheezeburger once you get this all wrapped up! :sunny:

1 Like

Ok I have a working version of:

  1. a longer IWDG timeout of 26.208 seconds (this is the max)
  2. IWDG does not prevent the user code from running.
  3. if IWDG has occurred and we are back on the Cloud, the RGB breathes red.
  4. If we want to detect IWDG reset in user code, just poll IWDG_SYSTEM_RESET and if we reset it to 0, the red breathing will change back to cyan breathing.
  5. I have a new location for the reload code, at the end of the main loop (just after user code runs. It’s set to reload every 8 seconds. So the countdown would start at 26.208… get to 18.208ish and reload to 26.208. This is sort of arbitrary, but I made it slightly less than the delay that blocks the Heartbeat code. If anyone has a better idea, share it now before I get to the pull request stage :wink:

LAST THING NEED TO FIGURE OUT: How to allow the user to stop user code execution so that they can reflash their core OTA. The thought would be that they notice the breathing red, and assume code is blocking for too long somehow. Worst case scenario is the code blocks right away and does not allow the OTA update. Factory reset would fix this, but not running user code is more elegant and arguably more safe.

SO… I had planned to use the same RESET / MODE button sequence as it takes to get you into DFU mode. So after YELLOW flashing, it would flash MAGENTA, and then WHITE. Obviously you all know by now WHITE is the factory reset. So if you let go of mode during the flashing MAGENTA… this would RESET a variable such as USER_CODE_ENABLED and boot the firmware. Normally USER_CODE_ENABLED would be set, so it’s always going to run, but in this case it won’t… allowing everything else to loop and OTA to take place. Only problem is, that MODE / RESET button code is in the BOOTLOADER :frowning: So while it seems like the best place, I’m a little hesitant to just go modify it because that’s not an easy/safe thing to replace in everyone’s Cores.

I also thought about working it into the normal MODE button press while your core is running… the one that gets you into listening mode. But it’s not really possible to make that super clear without changing the other modes slightly… for example. We could put it at the very end of of the time… hold mode for first 3 seconds and let go, you are in slow blue flash listening mode… hold for 10 seconds and let go and it will clear wifi profiles… KEEP holding past all that though and it would start flashing MAGENTA. let go and your code stops running. However… there is a catch 22 here. If your code blocks right away, there is no way to get out of your code even after we set USER_CODE_ENABLED to 0. It almost has to happen from boot :-/ So I would appreciate any kind of feedback on how we could do this.

I also thought… keep track of how many times the Core rebooted from watchdog reset, and if it happens so many times in a row at the fastest rate (26.208 seconds) we know the user code is blocking immediately and is a problem so on the next reset disable user code. However, this involves NVM memory, so it’s starting to get complicated. Also, I really like the idea of continuing to reset under lock ups to make sure things like CFOD can be cleared. If we stop just so the user can reflash OTA, well then their core may be sitting there connected to the Cloud with nothing going on after a few CFOD’s.

Another thing we COULD do is just always run user code no matter what… and advise if the RED breathing is seen… watch the Core for 30 seconds… if it resets during that time, it’s most likely resetting due to bad user code that is locking up. If not, it may have happened due to CFOD or some other intermittent thing in the user code.

Ideas? @zachary @satishgn @sjunnesson @RWB

I’m trying to keep up with you guys when it comes to understand what is going on here. Most of it is beyond me at this point of my understanding of Micro Controllers.

Here is what I get from what I just read @BDub

  1. You have the built in Watch Dog Reset function working on the ARM32 chip now. Great News.

  2. You have setup the Watch Dog Reset function to reset the Spark Core if it has not been kicked in 26.208 seconds. *Does this reset the ARM 32 or just the CC3000? Or dose the whole board go through a reset?

  3. To use the Watch Dog Reset Function we will just need to add the Kick The Dog code to the bottom of our main loop which will kick the Watch Dog every 8 seconds. If the Watch Dog is not kicked within 26 seconds the core resets and saves the end user from having to do a manual reset.

  4. After the Watch Dog Reset has been triggered we have successfully reconnected to the Cloud the LED will breath RED to let us know The Watch Dog Reset has happened. I think this is good so we know what has happened.

This reset is going to happen to everybody eventually and I would hate for people to see their cores breathing Red as if it was always in the error state. I think a better way to indicate the Watch Dog Reset has occurred is to program it to breath RED but have that revert back to breathing Cyan after 6, 12, or 24 hours has passed so it does not breath RED forever.

So to accomplish this it sounds like you can poll IWDG_System_RESET and if it is = 1 then we can have code to switch that back to 0 after 6, 12, or 24 hours which will bring the normal breathing Cyan back.


The above sounds perfect!

Now about this code blocking a OTA Firmware/Sketch Update due to the Watch Dog resetting the Spark Core from not kicking the dog while the update process is happening which takes longer than 26 seconds. That is indeed a downside if it prevents this normal process but I am totally OK with it.

I’m personally totally fine having to do a factory reset every time I need to upload new code to the Spark Core considering its going to provide me the stable Spark Core that I have been dreaming about. I’m somebody will have a better solution which will allow the current normal OTA code updates but for now the factory reset will be just fine with me since its only something that happens while initially setting up a project. I’m sure others will have other solutions for this.

Good work BDub! Sounds promising!

this sounds like great progress.

Some comments and thoughts.

  • Is there a reason to just kick the dog every 8 seconds? Cant we do it each loop of the user code?
  • The OTA is a hairy one. Ideally you would like the first OTA command to disable the watchdog and then once it is finished start it up again. Maybe that is a good general behavior for the Cores to make sure that the user hasnt enabled the watchdog anyway. thoughts @zachary?
  • Having to rely on the timing of the MODE button presses gets tough quick. So easy to press to long or short and since I believe the watchdog would reset the Core every now and then having to long press 10+ seconds to reset it would be a annoying extra step. Can we maybe do a click pattern? Like three quick presses on the MODE?

btw I just got this message from the Spark Forum when trying to type a answer. Maybe time to step back in this thread a bit...

"Let others join the conversation
This topic is clearly important to you – you've posted more than 22% of the replies here.
Are you sure you're providing adequate time for other people to share their points of view, too?"

Since people keep mentioning me I just want to chime in. I’m seeing all these comments, reading, and appreciating the discussion. These issues are deeply intertwined, difficult to untangle, and require serious thought to hit upon the best user experience. It’s tough to come up with a one-size-fits-all solution; the best one can do is to pick sensible defaults.

The fundamental choices to be made here are these:

  • How should we kick the dog? That is, what is the official signal that everything’s OK and code is still running vs. needing to be reset?
  • How long do we wait before deciding that something is wrong?
  • What should we do if something is wrong? Should we run the user code if it failed last time?
  • How long should the user code be allowed to block the rest of the Core firmware?

I think this last point is not clear to many beginners—that an Arduino only runs your code (so there’s no problem blocking forever) whereas the Core runs a lot of other code. The two “threads” have to play nice, sharing processor time, memory, SPI lines, network sockets and bandwidth, timers, interrupts, DMA, serial lines, and lots of other resources. The Right Way™ is for both :spark: code and user code to avoid blocking at all, but of course we don’t want to frustrate people who copy and paste code that works fine on an Arduino.

We chose one set of answers to these questions, but we can change it anytime someone hits upon a more elegant overall user experience.

Keep up the good discussion folks!

Also, to respond to specific ideas—I agree we should change the behavior after a watchdog reset. It no longer makes sense to silently not run the user’s code. A breathing red LED seems like a good signal something’s wrong. I’m not sure about whether it’s better to run the user code while the red LED breathes or to not run it. Maybe try running it with an orange breathing LED the first time, and if we fail again, breathe red and don’t run the user code.

Thoughts?

As a newbie to Arduino & Spark Core all I want for now is a workable solution to keep Spark Core online with code that runs fine without the need for a external micro chip to do the work of the built in Watch Dog function.

A more elegant solution can be worked up over time but for now if the internal watchdog circuit can be used successfully then please lets move forward with it since it will keep everybody from having to manually monitor and reset their core from freezing.

This solution does not have to be rolled out and pushed to everybody since its not perfect right now. .

If peoples code runs fine on the spark core and the only issue is the CC3000 causing the core to lockup then @BDub 's work sounds like a working solution for many people.

It sounds like this would be the temporary solution guide to using BDub’s Watch Dog Timer trick goes as follows:

  1. Get your factory fresh Spark Core working correctly with what ever code you want to run.

  2. Test your code and make sure the only issue your having is the WiFi freezing up stopping your code from running which requires a manual reset.

  3. Automate this resetting process by implementing BDub’s Watch Dog timer.

  4. If you see the Breathing Red LED then you know the Watch Dog did its job in the last 6 - 12 hours and everything is working just fine still.

  5. If you want to modify your code after implementing BDubs Watch Dog timer then just do a Factory Reset, and load your new code. Then apply BDubs 's Watch Dog Timer again once your happy with your newest code.

Is the above something that is doable to help us who are sometimes having to frequently manual reset the core?

I know this is not ideal for everybody but its ideal for a lot of us right now.

It will reset the STM32 which should reinitialize the CC3000. There is no RESET line on the CC3000 so the only way to hard reset that is with a power cycle. So far I have seen the watchdog timer code work during CFOD though, after which the Core connects to the Cloud... so re-initializing the CC3000 seems to work in lieu of a hard reset.

No, I meant it's automatically kicked in the main MAIN loop, found in main.cpp. It's the main loop that basically does this:

while(1) {
  runBackgroundTasks();
  if(online) {
    runUserSetupOnce();
    runUserLoop();
    
    // proposing to put watchdog reload here
    if(8 seconds elapsed) kickTheDog();
  }
}

Now to answer the question of why 8 seconds? I was initially thinking it would save time to not reload it every time we came out of the user code. Basically shortening the background task time before we get back to the user code again. However it appears the reload it really just one command, wrapped in a function call... which is not really much time difference compared to a counter comparison. I think we should just kick the dog each time we leave the user code, but before the background tasks are run again. This gives PLENTY of time to connect to the WLAN or CLOUD (26.208 seconds).

The only thing about resetting the ORANGE/RED breathing state back to CYAN after a fixed uptime period of 6, 12, 24 hours is... if you come back and look at your core after 25 hours, you won't know that it ever had an issue. I agree that breathing cyan is the best "look", but somehow you need to permanently throw some kind of status that things went bad. You can always clear this in your user code after you log the condition, but if you're not logging it... then you need to know about it. Perhaps there could be a different way to view the IWDG_RESET though... like breathe cyan but mix in a RED blip when the cyan fades out completely. It would be subtle, just enough to get the point across.

Only problem with NOT having the IWDG enabled, is if your code (any code) locks up before you enable it... you are toast. So that's why one of the first things you enable is typically the watchdog. And if your user code locks up immediately on entry, you'll never loop through the background tasks enough (as in not at all) to catch the OTA update.

Basically the mode button doesn't do anything until you hold it down for 3 seconds currently... so a shorter press and release of say between 0.250 - 2.99 seconds could reset the USER_CODE_ENABLED flag, and make the RGB breathe magenta to indicate it's connected to the cloud, but NOT running your user code. Magenta is associated with flashing user code already so it's a good tie in color, and breathing tends to indicated connected to cloud.

Since there is a very simple loop as depicted above, I think it makes sense to kick the dog every time through that process. But only when it's ONLINE because of issues like CFOD, or WLAN not connecting. Once CFOD is a vague memory, we could move the kicking of the dog to just before the backgrounTasks(); but if CFOD is not an issue anymore WLAN not connecting still could be... so perhaps in the ONLINE part of the code is the best place overall. If you disable SPARK_WLAN_ENABLE, the IWDG still functions to serve as a watchdog against user code locking up. And technically CAN be disabled in user code if need be. I think there MAY be an issue with SmartConfig here since that takes a while to work, and is not considered ONLINE at that point. I have an old timing diagram that seems to indicate SmartConfig is it's own tight loop. Anything like that obviously needs it's own kicking of the dog within it.

How long to wait is a good question... but right now in hardware with the IWDG the longest delay is 26.208 seconds. If we wanted to wait longer than that, I can't currently think of a good way to add a foolproof software counter that extended it. It would be susceptible to lockups. 26.2 seconds seems like PLENTY of time to connect to your WLAN, or the CLOUD. if not, it gets reset and it can try again for 26.2 seconds.

It's kind of hard to know WHAT failed when you reset from IWDG... unless we are constantly writing the state of where the code is to non-volatile memory as it's looping... but I think we would wear out the memory pretty fast. Even if we knew it was USER code that locked up... why should we automatically prevent it from running again? Maybe it just locks up intermittently once in a while... say, on the hour exactly because of some counter wrapping or a bad compare to some time variables. If we just reset and run the code again, the user's code will run for a whole hour before it locks up again... maybe providing them with precious sensor data. We can latch an indication on the RGB that IWDG has occured, which should help a user to realize there is a problem potentially with their code, or network. If there was an easy way to check uptime, a user could see the IWDG indication and send a request like: https://api.spark.io/v1/devices/?access_token=xxxxx to check uptime to see when it reset last. Uptime is basically the millis(); counter, and could easily be implemented in user code as a variable as well, but wastes one of your available variables.

User code should be able to block up to the IWDG timeout value of course :wink: I understand though that if user code blocks for more than 10-15 seconds the Core will drop off the Cloud. So why wait longer than that? To allow plenty of time to get the WLAN and CLOUD connected in the first place.[quote="zachary, post:56, topic:2693"]
A breathing red LED seems like a good signal something's wrong. I'm not sure about whether it's better to run the user code while the red LED breathes or to not run it. Maybe try running it with an orange breathing LED the first time, and if we fail again, breathe red and don't run the user code.
[/quote]

I've experimented with orange on the RGB and it just looks yellow and or red. It's hard to tell it's CLEARLY orange if you never stared at all of the other colors. Perhaps the red blip idea weaved into the cyan breathing would be best? Could even blip once, twice, thrice... for number of reset times. Anything over 3 is going to start being too many blips to count, ,so you can just assume it's reset a lot. I do think we need to run the user code until the user decides not to... pretend everyone is designing a Black Rocket... mission critical stuff.

I think the watchdog code only works properly if it's always running... i.e. the core-firmware sets it up... and user code can augment it (make it time out faster or disable it completely) ... but you should not have to figure out how to setup the watchdog in your "arduino-like" code. You have better things to worry about! :smile: Understanding the codes is key though to knowing what's working and what's not working.

I like the red led blip idea and beep up to 3 times max to indicate reset times.

If its easy to implement then make a quick button press on the Spark Core clear this Watch Dog error message.

Keep up the Excellent work, I feel like were almost there and on to other exciting things LOL :smiley:

@BDub

How do you think your Watchdog would have handled this situation where the main loop is still running but the CC3000 will not reconnect to the WiFi network once its shows up again after being gone for 90 mins?

Here is a video showing the issue. I see its often when I leave with my phone/Hotspot and then come back about an hour later.

Usually the main look will stop running when this happens and the blue LED will not flash indicating a successful hand off to Xivley. And possibly the main loop would have stopped running eventually which would have triggered the Watchdog circuit.

Watched your video… With my watchdog code, your Spark Core would have reset itself in 26.208 seconds (assuming perfect 40kHz LSI clock) after not being connected to the WLAN, and it would have reconnected to your network :wink:

Please do give my test code a try, it’s current with the Core-Firmware and Core-Common-Lib as of right now 2/11 but you’ll need to grab Spark’s Core-Communications-Lib before you build:

I’ve seen this reboot two cases of “CFOD on power up” as well so this is promising.

https://github.com/technobly/core-firmware/tree/watchdog_timer_fix
https://github.com/technobly/core-common-lib/tree/watchdog_timer_fix

Check out the application.cpp for examples of how to detect and clear IWDG status, and an example of it resetting itself due to a hard loop in the user code.

I still would like to figure out why just resetting the IWDG_SYSTEM_RESET does not restore the CYAN breathing, you have to force it back with LED_SetRGBColor(RGB_COLOR_CYAN); I need to understand how the Timing_Decrement() function in main.cpp works better I guess. Obviously the code I put in there updates the RGB from CYAN to RED, but won’t go back… so does it only get run once somehow?

Also I’ll need to figure out a good place to put the RED blip code, and see how that will look and function.

And finally intercepting the Mode button to lock-out user code. Might have to get creative with this one. I think it will work without having to use non-volatile memory or resetting the Core.