And we’re offline again (mostly)

The Asterisk is currently offline (mostly)

Some classes of call can be completed in some circumstances (eg calling the speaking clock on 400 works)
Some calls complete but with no audio (I believe calls between our home phones do this)
All calls out to the strowger network fail.

Calls in from the sipgate dial in number get as far as the voice prompt on the asterisk box itself, but can’t dial out into the strowger and seem to have the same audio problems (ie if I call my voip number I get no audio)

My remote access has stopped working, so I can’t troubleshoot it in detail from here, detailed troubleshooting will have to wait until I’m able to get on-site. Unfortunately that’s not until Saturday 9th April.

Updates will follow on here as and when I know anything.

Asterisk to UAX Junctions – Update

Over the Christmas period, I spent quite a lot of time giving out Asterisk config a good going over, trying to identify the root cause of it not always releasing junctions at the end of a SIP->UAX call.

In the process, I found and fixed several issues.

  • We now pause slightly before dialling, to play a little nicer with slow line finders.
  • Whenever we use Dial to send a call out to the UAX, we explicitly jump to the (h)angup extension, so that we can trap the end of the call
  • We’re now a bit more thorough about logging what we’re doing as a call progresses. This isn’t a “fix” as such, but it does help debugging intermittent errors
  • Our DAHDI config was using Loop Start (ls) rather than the preferred KewlStart (ks) signalling. ks is supposedly a bit better at disconnect supervision
  • We upgraded from Debian Wheezy to Debian Jessie. This gives us an upgrade to a more recent version of asterisk
  • In sip.conf we now set an rtp timeout so that we end the call if we don’t receive any RTP (audio) traffic for 60 seconds. This should help to end calls which haven’t been torn down properly
  • We turned off “SIP ALG” on the BT Business Hub router as this seems to be causing SIP packets to occasionally go astray
  • We were rotating the logs a little aggressively and were only keeping a couple of weeks. This has been expanded so we now keep several months – which should make retrospectively investigating issues easier.

With all that done, we decided last week to reinstate one of the junctions from Asterisk to UAX. By only enabling a single junction, we can easily detect when it gets stuck (as any subsequent calls will get NU from the asterisk)

It’s been back up for a week, and so far it’s working well. I’m still keeping an eye on it, and will continue to do so for a while before we re-enable the second junction.

Calls from Asterisk->UAX offline

To quote todays diary entry from http://dfrtelecoms.org.uk/diary15.htm

December 19: Peter came in today to find an alarm from Lydney Signal Box. Rick and I met him at the box. It was a low volt alarm caused by a junction held permanently. This had the ringer running continuously. The volts were down to 29 when we arrived. Disconnecting the junction cleared the fault and stopped the ringer. The battery voltage started to rise slowly. Hopefully it will all be OK. Back at Norchard we found the Asterisk holding the connection. We don’t know why, the circuits from the Asterisk have been pegged out and the problem referred to Paul.

As a result, calls from the Asterisk to the internal phones at the railway are offline. (This includes calls connected from the sipgate number)

This problem seems to have started since we moved the FXO ports on the Asterisk from being directly connected to an incoming selector on the UAX, to being connected to a pair of line circuits.

I can’t get to site to do any direct investigation (and I can’t get to my copy of Atkinson to do any diagram based theorising because it’s all packed up for some decorating work) but I suspect something is behaving differently from an electrical point of view.

I have a hunch it might be related to line reversals on called-sub-answer, but I need to check that before I can make any changes or recommend that we reconnect the circuits.

Watch this space…

Back online…

Well that was an effort.

We’re back online, it seems my port forwarding documentation is accurate – but our dynamic dns provider is being “less than perfect” at the moment.

To cut a long story short, they were claiming we hadn’t changed our IP when we had (even when I tried to force an update) so weren’t propagating the changes.

All sorted for now, but I’m going to have to find another way around this DDNS issue.

New router

We’ve got a new BT Broadband router, and the new one hasn’t been configured to work with the asterisk yet.

So at the moment, the phones we’ve got at home are all offline, and my remote access is also offline so I can’t fix it remotely.

The monitoring doesn’t make this state of play obvious, but I’ll sort that out once I’ve got access.

It’ll all come back to life when I next make it to site (hopefully Saturday 21st, weather permitting)

New monitoring

Since we went static IP for our broadband, I can no longer infer the state of the Norchard broadband from the number of times we change IP address per hour (previously every time the broadband connection went down it would come up on a different IP address)

So I’ve put some better monitoring in place, and http://dfrvoip.org.uk/blog/status/ now tracks our connection to the outside world by pinging the google DNS servers.

Due to the way they’re hosted – the graphs on that page might not be visible on every interenet connection. I’ve got a few ideas about how to change that, but they’ll have to wait for another day as life is somewhat hectic at the moment!

The technology I’m using is pretty basic network monitoring software called smokeping. It’s not my monitoring tool of choice, but it’s easy to install and get going – and for something as simple as monitoring a single internet link it’s pretty good.

The gratifying results of this are that our current broadband looks a lot more stable than the previous broadband!

Changes to dial in access

Some time ago, I put in place a test number on our Asterisk system to allow the Telecoms Team to access the internal telephone network from the public telephone network, so that we can test and verify facilities, and access our exchange test numbers from home.

It’s been brought to my attention that this number has “leaked” and is now being used by at least one person to access the internal network from their mobile phone during the running day.

Apparently me working on the Asterisk DP during the day on the 8th August caused an issue for this person, and while they didn’t complain to me directly, I did get to hear of the complaint.

I don’t know who this person is, but I can see that they have used the facility 27 times in the last 3 weeks and rumour has it they are operational staff – and that worries me.

The Asterisk system which hosts the test number is maintained by a single volunteer who has a full time day job in Bristol (me!), and I can only really work on it at weekends – during the running day. It has no resilience or redundancy designed into it, and the phone number is provided by a free provider (so may go away or change at short notice)

In the past, when the Asterisk has failed it’s taken me 3 weeks to get to site to resolve the issue (in one case, much longer) If that was to happen again, and someone was relying on it for operational use then that may put the safety of the railway, or the continuity of the business at risk!

Given these limitations I cannot in good conscience allow the facility to become relied upon for business, operational, safety or emergency use – so I need to nip this in the bud and clarify the position.

I have changed the recorded message on it this evening to something which makes the status of the facility clear, and if the message doesn’t appear to get through – I may be forced to change the PIN.

Sorry if that’s inconvenient for anyone, but I just can’t let an informal test facility sneak into operational service like this!

Two minor fixes

I’ve done two minor fixes this week:

  • Status Graphs: These were being updated, but not drawn for most visitors. I’ve fixed that now, if you still have no graph try a force refresh of the page (ctrl-f5)
  • Speaking Clock: Ian noticed that it was saying “thirty four” twice, skipping “thirty five” and jumping straight to “thirty-six” – This was a fault I fixed on my asterisk server years ago, but seemingly never applied the fix to the railway version. The route cause was that the “35.wav” file contained the words “thirty four”. I’ve updated to a newer version of the samples and everything is working again.

I’m going to re-think the status graphs, as we now have static-ip so don’t need dynamic DNS any more (so don’t really need to monitor it any more).

Perhaps I’ll finally write the peer monitoring stuff as well, so we can spot when sipgate goes away.

We’re definitely back now. Hopefully.

After much head scratching, inconsistent one way audio problems, some routes through the astersisk producing audio some not…  I noticed I’d set the port forwarding on the router for the RTP stream as “TCP” not “UDP”.  Simple slip of the finger, ticked the wrong box on the router interface.

It caused some really weird problems though!

Any audio path which resulted in the asterisk setting up the RTP stream worked, but anything which relied on the VOIP phone initiating the RTP stream ended up with either no audio, or one way audio.  This wasn’t immediately obvious from the pattern of symptoms as reported!

Anyway, I have rectified the error, ticked the right box, and my testing seems to suggest that it’s working now.

I’ll try and fix the next fault a bit quicker, promise!

And we’re back. Hopefully.

The router was reset on March 15th to try and troubleshoot the persistent problems we’re having with it, unfortunately in the process all the port forwarding rules we require dropped off the router.

Due to holidays, easter and various other commitments I wasn’t able to attend until April 7th and while I thought I’d fixed it then I discovered when I got home that I hadn’t – and that one of the changes I thought I’d made hadn’t been saved.

So it’s now April 18th (over a month since James reset the router under advice from BT) and we should now be back online.

I think this is a perfect example of why the DFR VoIP system should never be considered a “production” or “business critical” service – anything which goes wrong and relies on a chap who lives a 40 minute drive away and has a full time job (so isn’t available in business hours) just isn’t going to have a quick turnaround for fixes!