Computing

Finding Order in Geos: A Furious Attempt to Make Melbourne's Open Public Transport Data Useful

Motivation

Public Transport Victoria (PTV) released their Application Programming Interface the 6th of March last year, allowing the public to make some queries on timetable data. This is a great first step in the right direction in terms of an Open Data policy, but unfortunately, the results are a bit thin on usefulness. For example, you can't find out what trains are arriving at a stop (only those which are departing); if you're trying to make a journey planner in some kind of polynomial time (ie – in time to catch a train), you're basically out of luck. Furthermore, and perhaps even more problematically, when asking about what stops on a line exist, it'll give you an alphabetical list; I have no idea of the use-case for this, other than annoying everybody.

Despite these shortcomings, I was determined to make use of this data, and am doing so at https://tt.joelbuckley.com.au/. On posting this site to Facebook for feedback, a colleague wrote me a great post outlining the issue with an alphabetical stop list – and I agree: it's basically useless, particularly for an application designed for quickly finding the next train!

So, I set about figuring out how to list a bunch of stops in order.

Passengers are Travelling Souls, Man

It soon occurred to me that this was basically a Travelling Salesman Problem (TSP), and I was pretty quickly disheartened. The TSP is a famous NP-hard problem, which means that the only way to get the most efficient route is to iterate through all of the possible permutations. If you have 3 stops on your line, that's only 6 permutations to try out, but if you have as many stops as the 901 bus (I enjoy using the pathological example as a foundation for testing an algorithm because it saves on re-writes when I discover something utterly naïve), that's far too many to deal with. I needed a plan.

Thinking about this further, I realised I might have already solved part of the problem, in that I already scan express trains routes, to find the stations missing in between listed stops, as per East Richmond, Burnley, Hawthorn, and Auburn below:

Screen Shot 2015-01-12 at 8.44.24 pm

 

 

Furthermore, this wasn't just another Travelling Salesman Problem – there was additional information, implicit in the data: this was a route a bus might take. This probably means that wild deviations are unlikely without intermediate points, and that in general, it will head in the one direction.

Diff'rent Folks

Happily, in amongst trying to think up different approaches to deal with this, my family got together for my mother's birthday, which allowed me to ask four pretty smart people – some of whom hate maths – how to to figure out the line order from a series of points. I gave them the following information:

  1. I have a series of tuples, which consist of an identifier (e.g. a stop name), a latitude, and longitude;
  2. These locations are stops along a transport route.

From there, we came up with the following ideas:

  1. Find the two points with the largest distance between them.
  2. Knowing that we could apply the TSP to a few points at a time: iterate over a fixed selection size (removing from list of locations), remove the end-points of the solution, and insert those into the pool for the next iteration TSP with a new group from the list of locations.
  3. Group the points together, possibly by suburb, get an idea of what suburbs lie where on the map, and then drill down into each stop locality for a more granular approach.

Each have pro et contra, some of which can be stated from the outset:

Approach 1

  • Fast – O(n2) – that's great.
  • Distracted by vertexes on a shape I described as a 'vertically squashed trapezoid'; points which are close to the true origin of the transport service, but which require the service to travel in the opposite direction to the rest of the stops for a time. A travelling salesman algorithm would (hopefully) deal easily with this trap, whereas a naïve distance algorithm, while probably correct for the majority of the time, would struggle on the more circuitous lines.

Approach 2

  • More algorithmic complexity.

Approach 3

  • Same complexity as approach 2.
  • Ability to tweak the algorithmic approach between suburbs and locations within a suburb.

Coding It – Outcome

Unfortunately, a constraint is that the processing time could only be about a second or so – users of web applications won't wait much longer. Therefore, the following implementation of an iterative travelling salesman approach was too costly:

function sort_line($data_out) {
	define("CYCLE_LIMIT", 2);
	$i = 0;
	$p1 = array();
	$p2 = array();
	while($i < count($data_out)) {
		$j = 0;

		$tsp = new TSP;
		if($i > 0) {
			$tsp->add($p1->location_name, $p1->lon, $p1->lat);
			$tsp->add($p2->location_name, $p2->lon, $p2->lat);
		}
		while($j < CYCLE_LIMIT && $i < count($data_out)) {
			$tsp->add($data_out[$i]->location_name, $data_out[$i]->lon, $data_out[$i]->lat);
			$j++;
			$i++;
		}
		echo 'locations' . "\n";
		print_r($tsp->locations);

		$tsp->compute();

		$p1 = get_stop_name($tsp->shortest_route()[0], $data_out);
		$pos = count($tsp->shortest_route()) - 1;
		$p2 = get_stop_name($tsp->shortest_route()[$pos], $data_out);

		unset($tsp);

	}
	return array($p1, $p2);
}

function get_stop($stop_id, $array) {
	foreach($array as $key => $data) {
		if($data->stop_id == $stop_id) {
			$out = $data;
		}
	}
	if(!isset($out)) {
		$out = array();
	}
	return $out;
}

function get_stop_name($stop_name, $array) {
	foreach($array as $key => $data) {
		if($data->location_name == $stop_name) {
			$out = $data;
		}
	}
	if(!isset($out)) {
		$out = array();
	}
	return $out;
}

A useful trick, it turned out, was to use the approach of 'divide and conquer', splitting the list of suburbs into smaller and smaller chunks, ordering them subsequently (using a polynomial-time algorithm). This is the result:

function order_suburbs($p1, $p2, $suburbs, $print = true) {
	if(count($suburbs) > 1) {
		list($p1n, $p2n) = suburb_endpoints($suburbs);
		if(haversineGreatCircleDistance($p1['lat'], $p1['lon'], $p1n['lat'], $p1n['lon']) > haversineGreatCircleDistance($p2['lat'], $p2['lon'], $p1n['lat'], $p2n['lon'])) {
			$pt = $p2n;
			$p2n = $p1n;
			$p1n = $pt;
		}
			$i = 0;
			$dist_diff = 0;
			$min = -1;
			foreach($suburbs as $key => $data) {
				$distance_p1 = haversineGreatCircleDistance($data['lat'], $data['lon'], $p1['lat'], $p1['lon']);
				$distance_p2 = haversineGreatCircleDistance($data['lat'], $data['lon'], $p2['lat'], $p2['lon']);
				$dist_diff = abs($distance_p1 - $distance_p2);
				if($min == -1 || $dist_diff < $min) {
					$index = $i;
					$min = $dist_diff;
				}
				$i++;
			}
			$midpoint = $suburbs[$index];

			$p1n_key = array_keys($suburbs, $p1n)[0];
			$p2n_key = array_keys($suburbs, $p2n)[0];
			unset($suburbs[$p1n_key]);
			unset($suburbs[$p2n_key]);

			unset($suburbs[$index]);

			$suburbs = array_values($suburbs);

			$p1suburbs = array();
			$p2suburbs = array();

			foreach($suburbs as $key => $data) {
				if(is_between($p1n, $midpoint, $data)) {
					$p1suburbs[] = $data;
				} else {
					$p2suburbs[] = $data;
				}
			}

			$array_out = array_merge(array($p1n), order_suburbs($p1n, $midpoint, $p1suburbs), array($midpoint), order_suburbs($midpoint, $p2n, $p2suburbs) , array($p2n));
			return array_filter(array_unique($array_out, SORT_REGULAR));
	} else {
		return $suburbs;
	}
}

function get_stop($name, $array) {
	$out = array();
	foreach($array as $key => $data) {
		if($data['suburb'] == $name) {
			$out = $data;
		}
	}
	return $out;
}

function find_suburbs($mode, $line, $data_out) {

	/* Get each suburb, once, into an array with lat and lon.
	We could do mid-point, but let's just get it working first,
	with a randomly chosen lat/lon */
	$suburbs = array();
	foreach($data_out as $key => $data) {
		$suburbs[$data->suburb] = array('lat' => $data->lat, 'lon' => $data->lon);
	}

	// Get the suburb name out of the key and into the data
	foreach($suburbs as $key => $data) {
		$suburbs2[] = array('suburb' => $key, 'lat' => $data['lat'], 'lon' => $data['lon']);
	}

	return $suburbs2;
}

function suburb_endpoints($suburbs) {
	define("CYCLE_LIMIT", 3);
	$i = 0;
	while($i < count($suburbs)) {
		$j = 0;

		$tsp = new TSP;
		if($i > 0) {
			$tsp->add($p1['suburb'], $p1['lon'], $p1['lat']);
			$tsp->add($p2['suburb'], $p2['lon'], $p2['lat']);
		}
		while($j < CYCLE_LIMIT && $i < count($suburbs)) {
			$tsp->add($suburbs[$i]['suburb'], $suburbs[$i]['lon'], $suburbs[$i]['lat']);
			$j++;
			$i++;
		}

		$tsp->compute();

		$p1 = get_stop($tsp->shortest_route()[0], $suburbs);
		$pos = count($tsp->shortest_route()) - 1;
		$p2 = get_stop($tsp->shortest_route()[$pos], $suburbs);

		unset($tsp);

	}
	return array($p1, $p2);
}

function is_between($p1, $p2, $test) {
	$return = false;
	$a = haversineGreatCircleDistance($p2['lat'], $p2['lon'], $test['lat'], $test['lon']);
	$b = haversineGreatCircleDistance($p1['lat'], $p1['lon'], $test['lat'], $test['lon']);
	$c = haversineGreatCircleDistance($p1['lat'], $p1['lon'], $p2['lat'], $p2['lon']);
	if($a * $b != 0) {
		$angleC = acos((pow($a,2) + pow($b,2) - pow($c,2))/(2 * $a * $b));
	} else {
		$angleC = 0;
	}
	if($angleC >= pi()/2) {
		return true;
	} else {
		return false;
	}
}

function list_ordered_stops($mode, $line) {
	$data_out = api_stops_list($mode, $line);
	$suburbs = find_suburbs($mode, $line, $data_out);
	list($p1, $p2) = suburb_endpoints($suburbs);
	$p1_key = array_keys($suburbs, $p1)[0];
	$p2_key = array_keys($suburbs, $p2)[0];
	unset($suburbs[$p1_key]);
	unset($suburbs[$p2_key]);
	$suburbs = array_values($suburbs);
	$suburbs = array_merge(array($p1), $suburbs, array($p2));

	$cbd = array('lat' => -37.8136, 'lon' => 144.9631);

	if(haversineGreatCircleDistance($p1['lat'], $p1['lon'], $cbd['lat'], $cbd['lon'] < haversineGreatCircleDistance($p2['lat'], $p2['lon'], $cbd['lat'], $cbd['lon']))) {
		$pt = $p1;
		$p1 = $p2;
		$p2 = $pt;
	}
	$ordered = order_suburbs($p1, $p2, $suburbs);

	return $ordered;
}

Evaluation

However, because this is an approximating algorithm, based on suburbs, there are some pitfalls. For my train line however, it does pretty well:

hurstbridgeAnd, tram routes are fairly similar:

55_tram

But, the 901 bus – The White Whale – is a bit skewiff:

901_1The reason for this is that it makes sense for the 901 bus to terminate at Melbourne Airport – a central hub of that region – but not at one of its most extreme points, Roxburgh Park. The TSP algorithm only know latitudes and longitudes, and so nominates Frankston and Roxy as likely endpoints, and starts from that assumption. Even when correcting for these aberrant initial conditions, things aren't 100% peachy (although there's certainly an improvement on the number of cross-metropolitan links):901_2

Particularly, this is messy:901_3Furthermore, the stops within a suburb aren't yet ordered, although I plan to implement this some time in the next fortnight or so. However, given that it takes only a few seconds to go from a standing start to a usually accurate listing the order of suburbs hit, I'm happy with the output of this algorithm to date.

One Comment

  1. JB

    This is too much fun!. This was a good read. I don't get into the coding. I might be stating something obvious, but have you tried calculating the series of nearest neighbours. Start at any point and then string them together? Once a station has a nearest neighbour, they get taken out of subsequent calculations.
    I see some pitfalls, but...