Known issue: long delays or blocking code kills the connection to the Cloud

@mrOmatic this might be a real crash due to your code. Try making your minutes and seconds variables float or double rather than long.

1 Like

Hello all.

I got my cores a few days ago and started programming them. I too saw the consistent crashes. Eventually I noticed that it was crashing during a call to client.println, and perhaps Serial.println too.

Since what is being done in the example code above is nearly identical to what I was doing I thought I would share my “fix” - Don’t use println :slight_smile:
I format the string I wish to print, then use .write with a byte count.
For example, you could change the print calls above to something like this:

char io[80]; // I put this up before setup() to create a global scope character buffer

sprintf(io, “%d:%d\n”, (int)minutes, (int)seconds);
Serial.write((const uint8_t *)io, strlen(io));


Before I did this my Core was crashing every minute or two. After this change it has been running for nearly four hours.

@zach: If the crashes would actually relate to the use of long instead of a floating point number type it would still be a bug, since the outlined code is perfectly acceptable in C - and even more in C++ what your preprocessor seems to produce out of the user code (especially since all involved types are int types - even the constants 60 & 1000 - not 60.0 & 1000.0)

Fair enough, although it looks like the issue might actually be a bug in our implementation of Serial.println(). @mrOmatic are you able to get it working with @pilottrack’s fix?

I’m back from a few days of holidays, friends family etc

OK i’m testing the print thing now by removing the serial print completely and monitoring via the cloud, it doesn’t match with what I’ve observed but happy to test it out.

I’ve been meaning on posting my observation of the core crashing.

When I was doing the uptime testing I was monitoring via serial and via a script looking at the cloud service, additionally the cloud monitoring script would notify me via Pushover https://pushover.net/ when the could returned a 408 status (Request timeout).

What I found it the cloud side would crash first, I’d get a couple of notifications of 408 status (monitoring every 10 seconds.) I’d go back to my desk and the Spark LED would still be breathing and the serial monitor still happy scrolling down. Then in a few seconds the serial monitor would halt and the LED would switch to flashing.

So it always seemed that the Cloud side would fail first and then the user application would crash second.

I’s just an observation, hope it might be useful.

So just for kicks I had another play with testing uptime when using a 1 second delay.

I striped out all the print comments and just used this…

int UpTime = 0;

void setup() {
  Spark.variable("UpTime", &UpTime, INT);
}

void loop() { 
  UpTime = int((millis() / 1000)  / 60);
  delay(1000);
}

So it still consistently crashes still at fairly random intervals, interestingly enough the average for 10 runs on this code was down to 9.9 minutes where the previous version had a 10 run average of 17.2 minutes.

So that’s enough of that, it looks like delay’s crash cores, especially when my other core has been running far more complex code doing TCP, MQTT, debouncing button’s etc for day’s without crashing.

So for clarification the test’s I’ve been doing relate to the ‘Cyan flash of death’ and not apparently the other could connection problem from the original post.

A fix has been implemented, but not yet merged into master. Check out this feature branch:

You’ll need to get the context-switching branches of all 3 firmware repositories and build locally, following instructions on the README. This will be pushed to the web IDE when it’s been fully vetted.

This feature is still in testing, and leaves very little RAM for the user, which is something we’re working on. However if you’re all about the bleeding edge, check it out!

1 Like

Subscribing to this thread so I can hear about when this patch makes it to the IDE!

It sure was glorious to see my output flying into my console one time though. The nice thing is I know my loop is still running fine since I hooked up an LED that changes it’s brightness based on the analog pin’s output.

Even after a factory reset I haven’t been able to see that output again :frowning:

And here is the code that I’m using on my core:

int LED = D1;
int TILT = A0;
int val = 0;

TCPServer server = TCPServer(4444);
TCPClient client;

void setup() {
    Serial.begin(9600); //adding or removing this makes no difference
    pinMode(LED, OUTPUT);
    server.begin();
}

void loop() {
    //get the current value of the tilt sensor
    val = analogRead(TILT);
    
    //use the debug LED to show things are still working
    analogWrite(LED, val/16);
    
    //if any clients are connected, fire them some data
    client = server.available();
    if(client) {
        server.println(val);
    }
    
    //fast fast!
    delay(10);
}

@Endoplasmic, what was your last commit status after “context-switching” pull.

The recent increase of WLAN SPI baud rate settings didn’t work well with context-switching so had to revert that back to older settings.

Since this branch and dependent branches are under continuous development and testing, not all commits would work as per expectation.

The last known stable build commits(core-firmware and core-common-lib) is this:


Older stable commits:


@Endoplasmic,

Getting the TCP and other Network related libraries to work under context-switching branch is Work in progress. Presently those libraries don’t work and will lead to HardFault exceptions.

Oh I didn’t pull it down at all. Wanted to wait for the smoke to clear, which sadly means for now, dust to gather on the Spark :frowning:

Wanted to bump this up. Using serial -> USB has been fine for a while, but I want to be able to have my core connected somewhere else that isn’t the server (or to a computer at all).

Just curious if the TCPServer stuff is getting the love it needs is all :smile:

1 Like

Hi, can anyone tell me if this is still an issue with the latest firmware, or has it been addressed?

Thanks.

HI @Raldus

The current firmware version allows you to call delay() with a large number and still have the cloud work. If the delay value you give is larger than 1 second (1000 ms value) then the Spark cloud handling code is called, otherwise it is not.

The cloud handling code is always called when you reach the end of loop() and go round again.

You could still break the cloud connection by having lots and lots of little delays that add up to a big delay without going around the loop, but that is pretty unlikely.

2 Likes

Hi @bko @Raldus, the case where “lots and lots of little delays that add up to a big delay” is also handled well in the delay() logic so should not be a problem. The if ((elapsed_millis >= spark_loop_elapsed_millis) || (spark_loop_total_millis >= SPARK_LOOP_DELAY_MILLIS)){} handles both conditions: single long delays as well as many short delays adding to long delays.

2 Likes

Ah, sadly, I have one of these indoor temp/humidity sensors (I’m just playing around with it), and the data read time (lots and lots of microsecond-level delays) is on the order of 2 seconds (to be safe, I’d probably want to assume that a call to loop() could take as long as 4-5 seconds). The cheesy approach to reading data would use busy loops, but it sounds like calling delayMicroseconds() would not help here.

Maybe I’ll connect this up to a raspberry pi, instead.

Why don’t you try this?

1 Like

Thanks, I missed that.

What is currently best practices with regard to delays in the Loop() function?

Currently I have a loop that has a delay(1); and inside it counts up till it reaches 10 seconds.
After 10 seconds there is a process that runs which includes a delay(1000).

I’m considering removing the delay(1) in an effort to get faster samples from a microphone.
Thoughts?