Holiday Fun: Debugging a Home Network with Loggly
After a recent rainstorm with accompanying 40+mph wind gusts, I was having intermittent Internet connectivity problems. There were days went connectivity would drop for 20-30 seconds dozens of times. Then there would be stretches of perfectly solid connectivity. After turning off some extra devices and access points to simplify the LAN, there was some improvement, but it wasn’t perfect. It was time to start getting serious about debugging the issue, and the Christmas holiday provided the time necessary to put things in place. I had hoped for more time away from my day job running technical operations at Loggly, but luckily I had the skills and the log management service to get the debugging done quickly.
Looking at the logs on the home router, I quickly discovered that it had been rebooting itself. How often was impossible to tell, since the old logs were wiped at each reboot. I needed to get the logs from this router and a few other devices in one place so I could nail down what was going wrong. This necessitated configuring the only spare machine I had lying around; an old Aluminum MacBook Pro with OS X 10.4 installed.
The setup was fairly trivial, following the basic ideas from a MacWorld OS X Hints article. Once configured, the router setup was updated to pass syslog events to the MacBook. But I didn’t want to just sit there with the logs streaming waiting for something to break. Being able create a simple visual to look at the state of my network would also make quick work of narrowing down which logs to examine. Thankfully, I could pass my events to Loggly and they would provide the tools to visualize and correlate issues.
The version of syslog available on the old MacBook did not allow for altering the log headers as necessary to pass on them onto Loggly. I was going to need to have a script to watch the syslog file on the MacBook and pass the events to Loggly. With that in mind, I added some functionality to reformat the events into JSON to simplify parsing. The resulting script below tails a syslog file, reformats the log lines at JSON, then sends the events to Loggly over HTTP/S.
#!/bin/bash | |
## Read from a logfile/tail and send out json to Loggly. | |
## Does simple parsing on firewall events, but nothing else. | |
tail -Fn0 /var/log/all.log | while read line; do | |
## Did this come from my router | |
if [[ "${line}" == *"192.168.1.254"* ]]; then | |
## I ignore the original timestamp from the log file and | |
## replace it with current time. | |
## That will be within ms of reality. | |
## datetime=`echo $line | awk '{ print $1,$2,$3 }'` | |
datetime=`date -u +"%Y-%m-%dT%H:%M:%SZ"` | |
## What subsystem of the router is logging | |
source=`echo $line | awk '{ print $6}' | cut -f1 -d:` | |
## The log message. | |
value=`echo "$line" | cut -d: -f4-` | |
## echo "$source - $value" | |
## If this is something from the firewall, break it | |
## down further. | |
if [[ "${source}" == *"fw"* ]]; then | |
src=`echo $value | awk '{ print $1 }' | cut -f2 -d=` | |
dst=`echo $value | awk '{ print $2 }' | cut -f2 -d=` | |
ipprot=`echo $value | awk '{ print $3 }' | cut -f2 -d=` | |
sport=`echo $value | awk '{ print $4 }' | cut -f2 -d=` | |
dport=`echo $value | awk '{ print $5 }' | cut -f2 -d=` | |
message=`echo $value | awk '{ print $5 }' | \ | |
cut -f2 -d= | cut -f2-` | |
curl -s -k -X POST https://logs-01.loggly.com/inputs/[AUTHTOKEN]/tag/uverse -d "{ \"timestamp\": \"${datetime}\", \"uverse\": { \"source\": \"${source}\", \"value\": { \"src\": \"$src\", \"dst\": \"$dst\", \"ipprot\": \"$ipprot\", \"sport\": \"$sport\", \"dport\": \"$dport\", \"msg\": \"$message\" }, \"raw\": \"$line\" } }" > /dev/null | |
## Everything else... so far | |
else | |
curl -s -k -X POST https://logs-01.loggly.com/inputs/[AUTHTOKEN]/tag/uverse -d "{ \"timestamp\": \"${datetime}\", \"uverse\": { \"source\": \"${source}\", \"value\": \"${value}\", \"raw\": \"${line}\" } }" > /dev/null | |
fi | |
fi | |
done |
#!/bin/bash | |
## check-link2.sh | |
## Andy Mayhew <andy@loggly.com> | |
## Check the status of internal network devices and the uplink. | |
## Simple ping of each device. Then a simple ping of each side of the | |
## router, first hop onto the Internet, | |
## and finally a "distant" device. | |
## Log the ping response time or failure, build up a json event and | |
## send to loggly. | |
## | |
## usage: nohup ./check-link2.sh & | |
## List of devices | |
devName=("inside" "outside" "firstHop" "web.mit.edu" "AP Closet") | |
devIP=("192.168.1.254" "255.1.1.1" "255.1.2.3" "23.207.46.93" "192.168.1.250") | |
## Declare some global variables and set defaults. | |
tag="network2" ## tag our events in Loggly | |
token="[AUTHTOKEN" ## Loggly Auth Token | |
isUp="true" | |
failedEvents=() | |
## Sends an event to Loggly. Requires the log event as an parameter. | |
sendEvent() { | |
event=$1 | |
curl -s -k -X POST https://logs-01.loggly.com/inputs/${token}/tag/${tag} -d "${event}" | |
} | |
## Attempts to ping a device and record the result. | |
processPing() { | |
dest=$1; destName=$2 | |
results=`ping -c1 -t1 $dest | tail -2` | |
rtt=0 | |
packets=0 | |
up="false" | |
if [[ "${results}" == *"round-trip"* ]]; then | |
rtt=`echo ${results} | awk '{ print $(NF-1)}' | \ | |
cut -f1 -d/` | |
packets=1 | |
up="true" | |
fi | |
echo "\"${destName}\": { \"name\": \"$destName\", \"dest\": \"$dest\", \"rtt\": $rtt, \"packets\": $packets, \"upStatus\": $up }" | |
} | |
## We just loop forever. | |
## Could've fork, forked to daemonize, but chose to just | |
## use nohup when I call the script. | |
while true; do | |
rtt="" | |
## Cycle through the list of devices. | |
for (( i = 0; i < ${#devIP[@]} ; i++ )) do | |
## echo "$i - ${devIP[$i]} - ${devName[$i]}" | |
if [ "${#rtt}" -eq 0 ]; then | |
rtt=$(processPing "${devIP[$i]}" "${devName[$i]}") | |
else | |
rtt="${rtt}, $(processPing "${devIP[$i]}" "${devName[$i]}")" | |
fi | |
done | |
## Time stamp for this event. | |
timedate=`date -u +"%Y-%m-%dT%H:%M:%SZ"` | |
## Check the results from the ping tests. If any fail, set the global isUp flag to down (false) | |
if [[ "${rtt}" == *"false"* ]]; then | |
isUp="false" | |
else | |
isUp="true" | |
fi | |
## Build the event message, with ISO timestamp, global isUp flag, | |
## and individual responses. | |
message="{ \"timestamp\": \"${timedate}\", \"isUp\": ${isUp}, $rtt }" | |
echo "Sending: $message" | |
rez=$(sendEvent "${message}") | |
## check whether curl command worked. If not, push onto stack. | |
if [[ "${rez}" != '{"response" : "ok"}' ]]; then | |
if [ ${#failedEvents[@]} -eq 0 ]; then | |
failedEvents[0]="${message}" | |
else | |
failedEvents=("${failedEvents[@]}" "${message}") | |
fi | |
fi | |
sleep 5 | |
## Attempt to resend failed events | |
if [ ${#failedEvents[@]} -gt 0 ]; then | |
for i in "${!failedEvents[@]}"; do | |
echo "Resending: ${failedEvents[$i]}" | |
rez=$(sendEvent "${failedEvents[$i]}") | |
if [[ "${rez}" == *"ok"* ]]; then | |
unset failedEvents[$i] | |
fi | |
done | |
fi | |
done |
With the data now flowing into my Loggly account, everything I needed to debug my issues were all in one place. I created dashboard widgets to be able to see when any element of my network was down, along with the ping response times from the network checks. The visual elements allowed me to quickly see when something broken or if I was just experiencing a network slowdown.

From the visual view, I could quickly click on an event and see the precise logs from that time period. Then it was a trivial matter of looking at a few log lines from my various sources to find the culprit. In the case above, I could see in the logs that one of my internal WiFi access points was flapping. It kept losing radio sync with the main AP and would drop off the wireless network.

In the end, I found multiple points of failure in my network. The router from AT&T was flaking out, the U-Verse link was losing sync, and an access point on my internal network was flapping. A couple of days with some AT&T techs on-site to replace the router and the fibre from the pole to the house remedied the first two problems. And the problem access point has been turned off until I can make an Ethernet run from it to the router instead of letting the device expand the WiFi network. If I hadn’t been able to put all my logs in one place and easily created visualizations of the data, I would most likely still be trying to figure out what was going wrong.
Postscript: Telling AT&T customer support that, “the logs from the router indicate it is rebooting itself,” instantly gets your problem escalated within the support chain and a Tech at your house a few hours later on a Sunday. 🙂
The Loggly and SolarWinds trademarks, service marks, and logos are the exclusive property of SolarWinds Worldwide, LLC or its affiliates. All other trademarks are the property of their respective owners.
Hoover J. Beaver