Balloons, they seem like a simple product. I can’t imagine anything being difficult about sealing a couple of layers of plastic film together. Well, that’s what I use to think before I was asked to build a machine to make foil balloons. Over the last 17 years, I have been involved with the design, manufacture, and installation of these machines. They weigh in at about 10,000 pounds and are 30 feet long. A large percentage has gone to customers outside the United States. Each installation comes with its own challenges. When they are international, those challenges can be significant.
A few years ago on an installation in Central Mexico, I ran into a problem I’d never encountered before and at the time struggled with what could be causing it. This was the first sale to this customer and my first visit to their facility. Things weren’t going as planned, the machine wasn’t running. The customer was asking why, and I couldn’t explain why or what I was going to do about it.
These machines have a servomotor to index product underneath a number of seal heads. Each index has a controlled stop using optical registration of the film to position the pre-printed images under the seal heads. I specified the components, wrote the software and had done this many times before. I thought I knew the equipment inside and out and yet randomly the machine had a hiccup with the motion. Sometimes, the index speed would be off, going too fast or too slow. Other times the index length was wrong. Every hiccup seemed to be different without pattern. Every time this would happen, the very next index seemed to be fine. It might happen every 5 minutes or only once an hour.
I spent a day monitoring the in-coming power, looking for power anomalies to explain the malfunction. I went through every electrical connection of the machine thinking somewhere I would find a loose wire that could account for the random nature of this problem. I contacted the drive manufacturer, hoping they would tell me about some firmware revision that I wasn’t aware of and point me in a direction. I found nothing and at this point I had a machine that didn’t work, limited resources and the customer breathing down my neck.
It was getting toward the end of the third day and still no real clue with what could be wrong. I had the display (Operator Interface) disconnected from the servo to use the serial port for my laptop and had been monitoring the drive most of the day. It was then that it dawned on me that the machine had run without incident all day with the display unplugged. I plugged the display back in and sure enough about 5 or 10 minutes of running, and there it was. I left the display unplugged for the next shift as a test and the machine preformed perfectly without exception.
Ok, so now I had a clue where to look. Like most I instantly thought; I have a noise problem. This particular display has two serial ports. The second port was talking to a PLC, and that didn’t seem to have any problems, but that is far from a definitive conclusion. I didn’t have equipment with me to monitor the data stream. So I really couldn’t prove or disprove that noise was there or a problem. My experience with industrially hardened components is that if you follow some basic “best practices," noise is seldom the issue. I ultimately decided I would have to dig deeper, and got the customer to agree to run the machine as is (display disconnected) until I could find a solution to the problem.
After returning home, I made a number of calls to both the display and servo drive manufacturers. In conversation with one of the drive firmware engineers, I was asked the interval between data transfers from the display to the drive. I had to check and according to the data sheet, every 500 milliseconds. After discussing the amount of data I was transmitting, he told me that was probably too often. It was concluded I was overflowing the serial buffer. I had used this display / drive combination before, but previously it was just for data display. In hindsight, earlier installations probably had the same problem, but they were easy to overlook when the machine performance wasn’t affected.
After thinking about it further, it all made sense. The machine checks every cycle for motion parameter changes and recalculates the motion profile based on those values. When an overflow occurred, data was lost and the register being written to only received part of the data. Then the program would use that corrupted data to calculate the next move. There’s little wonder why the problem manifested itself in such a random manor.
The fix ended up simple; reduce the polling frequency of the display. I have used 10 or 15 different brands of servos over the years. I have used even more brands of PLCs and operator interface units. For the most part, I have had little problems integrating various brands of components, but when you run into compatibility problems seldom will you find the answer in the supplied manuals, you’ll most likely have to dig deeper.
Mike Frazier is vice president of Axis Automation in Hartland, WI. For most of the last 30 years, Mike has been designing and building automation equipment for numerous industries. In the beginning of his automation career, he worked as a machinist, serving an Apprenticeship at Centerline Industries in Waterloo, WI. In those early years, Mike took an interest in the control side of his projects. Since that time, he has developed controls systems based on PLCs, Motion controllers, PCs, and embedded controllers as well as combinations of those platforms.
Yes, I totally agree with all your comments. On reflection of your comments I would like to clarify a few things in the hopes your comments to software engineers are heard. I did slow down the polling rate and has stood the test of time. With that said, I agree with you and have since changed how I choose some components.
Let me explain:
With the balloon machine the device sending the motion parameters was an operator interface with very limited intelligence. The OIU sends data to the serial ports at a set rate, regardless of changes. The motion controller was fully programmable but the comms are all handled in the background with only the variables exposed in code.
Because of these parameters (limitations) when I am using an OUI (dumb interface), I will only use one from the drive manufacturer and not a third party unit. This is simply to try and prevent any unexpected communication anomalies. I am not suggesting for a moment that there is anything wrong with OUI from other manufacturers. Given the right application and enough time to develop it, I wouldn’t hesitate to use them but in my world I typically don’t have the luxury of time.
It is a false economy to buy a cheapo controller if this is the sort of thing it will do to you. (At least you found your fault quickly - imagine if your comms buffer overflowed once a week and your customer was on the other side of the world; you would inevitably have made a loss on the project with all the remedial work and your reputation would be in tatters. I won't risk that for the sake of a few tens or hundreds of dollars on a cheap knock-off controller.
I accept that your choice of equipment is sometimes limited and that even large manufacturers might have bugs in their kit. It is my hope that software engineers reading this will realise that such a simple bug in their code for circumstances which "will never happen" (ahem) could, in the real world, cause a major problem; on a bench in a lab comms failure is annoying. In a steel mill or a printing press with hundreds of tons of moving metal things don't go "click" they go "Bang" and people can get hurt.
Regarding your comment about motion controllers' sole job being to control motion, I almost agree. However, we all know that you have to know what position to be at, which usually / sometimes involves comms so if your comms fails your motion controller fails. I have designed products in which I even allow for a limited error concealment in comms messages so if a single message is lost due to noise (for example) the machine can be configured to make a "best guess", raise an alert so that the user program can take whatever action is required on that system. Of course, it won't do so indefinitely and there comes a point when it has to raise an error.
This is available behaviour, not default behaviour and the decisions are left to the implementer, but crucially the information is given to the user by means of warnings / trips and NOT just thrown away and ignored. And buffer overflow is strongly defended against by both software and hardware where available.
Incidentally, if the comms buffer did overflow who is to say that the next RAM locations won't contain the speed demand or some other state machine state. Or a user variable which controls some digital outputs which extend or retract a sensor or arm into the works of the machine? A data error could destroy the machine and therefore project timescales and potentially the whole project.
Sounds like a catastrophe waiting to happen for the sake of a bounds check on the receive buffer.
My comment was not apportioning blame; I was pointing out how if errors are handled PROPERLY when they are first known about then finding the real cause of a problem is so much easier. If your drive (or PLC or whatever it was) raised an alarm state then you would have "solved" the problem much more easily and quickly. (Quotation mark used because is the problem really solved, or just worked around? Could it ever recur of somebody plugs another comms cable into the system?)
This needed pointing out because a comms device which can overrun its own buffer is hideously bad. That's a COMMS101 type of error (if it's quite as described / interpreted) and would make me strongly reconsider whether the vendor really knows what they are doing.
I did the programming for another company on a project where I had to passing encoder positions over serial Modbus at a fairly fast rate. Everything worked as I expected until the speed increased. At somewhere around 4 inches a second, things started screwing up. I would have expected an ever increasing latency of position values but instead all the data just disappeared. The motion took precedence over the communications and being a less the full featured controller the position, there was no place to store the data. It was just lost.
At times I get to pick components and other times the customer specifies them. I don’t know if this speaks to a more rapidly changing market or the number of years I have been in the business, but more and more I find myself changing controllers due to end of life of a product. All these factors make it very hard to avoid all the pitfalls and many times force a work around.
I would agree your point is valid and in theory absolutely right. Reality as usually tends to be more shades of gray then the black and white answers we all hope and search for. You could blame me for poor component selection, the hardware manufacturer for poor firmware routines or maybe even certain organizations for not defining or adopting more universal standards.
In the automation world we are saddled with the task of specifying components, designing parts and then given 12 – 20 weeks to make it all happen. Most of the machines we build are a “one off”. The balloon machines are somewhat different as we’ve built quite a few of them but even those, give the span of time they were built over, have had 3 major controls changes / upgrades. None of this is an attempt to excuse poorly picked components or sloppy software but it does end up being a reality that we face on a daily basis. A lot of times we only get a week or two of equipment up time before shipping.
One thing I will say with regards to motion controllers and anomalies such as this one. A motion controller has one main job and that is to control motion, everything else comes second. I hesitated saying those words because I am sure I will never hear the end of comments fueled by them. I have experienced anomalies more than a couple of times over the last 30 years. I routinely contact the manufacturers in those cases but seldom get satisfactory answers with regards to time slicing and priorities other the management of the motion itself. Everyone can tell you about their “servo loop”!!
Sounds to me like the receiver was faulty - should it have been able to overflow its own buffer? Surely it should have rejected the message and not corrupted data beyond the end of its own receive array.
What you were seeing was arguably (if I have understood the problem correctly) a secondary fault. The receive handling was actually at fault. If that had returned a NAK or something it's likely the display would have detected it and stood a fighting chance of reporting the error to the next layer up. (Perhaps the user.)
This seems to me like a classic wild goose chase with a workaround; if the various compnents of the system had been designed with a more defensive mindset I bet this would not have taken nearly as much time to detect.
A practical approach to analyze and solve erratic running of the servo.At this situation because of proper analysis , it was solved by rewriting the program. Many times the noise or interference or poor supply filtering or regulation causes this kind of erratic operations. Mostly it is difficult to solve these issues without degrading the performance of the system.
David Patterson, known for his pioneering research that led to RAID, clusters and more, is part of a team at UC Berkeley that recently made its RISC-V processor architecture an open source hardware offering. We talk with Patterson and one of his colleagues behind the effort about the opportunities they see, what new kinds of designs they hope to enable and what it means for today’s commercial processor giants such as Intel, ARM and Imagination Technologies.