Tuesday 19 November 2013

finding flight legs - originally from Jonathan Gennick and SQL Pocket Guide

Recently I was studying SQL so that I could manage my customers data and offer better service.

I stumbled across the "findin flight legs" case study. However I could not fin the web page, so here is a copy from the wayback machine..


Original 

Jonathan Gennick

Writer * Book Editor * Oracle DBA * SQL & PL/SQL Developer
Father * Husband * Son * Mountain Biker * EMT

Finding Flight Legs

In response to a call for interesting applications of union queries, Nuno Souto sent me a query that applied INTERSECT to the results of two CONNECT BY queries as part of a solution to determine possible flight paths from one point to another. I was attracted to Nuno's solution, because I'd recently written some articles on CONNECT BY, and Nuno's use of it didn't fit the canonical parts-explosion pattern so often used in examples of recursive queries.

The Background

Imagine with me that you're in the package transportation business. Your company manages a large number of planes that fly a variety of flights paths between airports. A single plane will stop at one or more destinations. For example, on a given day you might have:
  • Plane QF1 that flies from A to B, and then on to Z.
  • Plane QF2 that flies from A to Z directly, and then stops.
  • Plane QF3 that flies from D to E to F, and finally to Z.
Each hop from one point to another is called a "leg". For example, flight QF1 consists of two legs: from A to B, and from B to Z. Flight QF2 consists of a single leg, whereas Flight QF3 consists of three legs. Each day's flights might be different from the previous day's. To keep all your planes moving efficiently, you use a computerized scheduling application that generates flight schedules several months into the future, storing the flight legs in the following table (example data at end of article):
create table flegs (
   flt varchar2(10),    /* flight name */
   loc_st varchar2(10), /* leg start */
   loc_en varchar2(10), /* leg end */
   tim_st date,         /* departure time */
   tim_en date          /* arrival time */
   );

Each flight is a trip made by a single plane consisting of one or more legs. Each leg has starting and ending location, and also a starting (departure) and ending (arrival) time. Of course in a real system you'd have additional columns, primary keys, and foreign keys to tables with other information. To make the technique described in this article easy to understand, Nuno and I have simplified the example to its essential elements.

The Problem

Here's the problem. A customer has a package that you need to ship from A to Z on July 30, 2003. You won't receive the package from your customer until 0600 (6:00 AM), and you must have the package at its destination by 1500 (3:00PM). You need to find the possible flight solutions for travel from A to Z in that time frame. This isn't as simple as looking for flights that begin at A and end at Z. You must also find flight solutions that begin at A and that stop at one or more other destinations on their way to Z. Notice I said "flight solutions". You must consider the possibility of switching planes (i.e. switching flights) midstream. To solve this problem, Nuno applied the following approach:
  1. Find all flights leaving point A, and all subsequent legs of those flights, during the specified time period.
  2. Find all flights arriving at point Z, and all preceding legs of those flights, also during the specified time period.
  3. Take the intersection of #1 and #2.
The following sections walk through these steps in detail.

Where Can We Go From Point A?

Consider the flights departing from point A. Getting the first leg of each flight departing from point A is easy; you can simply check the loc_st column:
select flt,loc_st,loc_en,tim_st,tim_en
from flegs
where tim_st between
         to_date('20030730 060000','yyyymmdd hh24miss') and
         to_date('20030730 150000','yyyymmdd hh24miss')
  and loc_st = 'A';
But this isn't good enough. You need more than just the first leg of each flight leaving point A, you need *every* leg. You need every leg so you can determine whether any of those flights eventually arrive at point Z. Nuno correctly recognized this as a recursive SQL problem:
select flt,loc_st,loc_en,tim_st,tim_en
from flegs
where tim_st between
         to_date('20030730 060000','yyyymmdd hh24miss') and
         to_date('20030730 150000','yyyymmdd hh24miss')
connect by prior loc_en = loc_st
   and prior tim_en < tim_st
start with loc_st = 'A'
   and tim_st >= to_date('20030730 060000','yyyymmdd hh24miss');
The START WITH clause specifies A as the beginning point. The CONNECT BY clause links the beginning of each subsequent leg with the ending of the previous. Legs are linked based on both location and time. We're looking for flights into, and then out of, a location, and departing flights must always occur later in time than arriving flights. The results are as follows:
FLT        LOC_ST     LOC_EN     TIM_ST          TIM_EN
---------- ---------- ---------- --------------- ---------------
QF1        A          B          20030730 070000 20030730 100000
QF1        B          Z          20030730 110000 20030730 120000
QF2        A          Z          20030730 080000 20030730 120000
QF5        A          B          20030730 070000 20030730 090000
QF1        B          Z          20030730 110000 20030730 120000
QF3        B          D          20030730 100000 20030730 110000
QF3        D          E          20030730 120000 20030730 130000
QF3        E          F          20030730 140000 20030730 150000
QF6        A          G          20030730 080000 20030730 090000
QF6        G          H          20030730 100000 20030730 110000
We now know where we can go beginning from point A between 0600 and 1500 on 30-Jul-2003. Not all flights from A will get us to Z. QF6, for example, gets us to H, which is not our destination. I'll explain how Nuno winnowed these results down shortly. It's interesting to notice that one path to Z involves switching flights. The results show the following solution:
  1. Depart point A on QF5 at 0700, arriving into point B at 0900.
  2. Depart point B on QF1 at 1100, arriving into point Z at 1200.
The query does not examine flight number. It simply determines all possible paths beginning from point B.

"From Where Can We Arrive At Point Z?"

Nuno's next step was to look at all flights into point Z, also during the specified time period. The query is much the same as in the previous section, except it now recursively traverses the tree of flight legs in the reverse direction, working its way backwards in time:
select flt,loc_st,loc_en,tim_st,tim_en
from flegs
where tim_en between
         to_date('20030730 060000','yyyymmdd hh24miss') and
         to_date('20030730 150000','yyyymmdd hh24miss')
connect by prior loc_st = loc_en
   and prior tim_st > tim_en
start with loc_en = 'Z'
   and tim_en <= to_date('20030730 150000','yyyymmdd hh24miss');
Following are the results:
FLT        LOC_ST     LOC_EN     TIM_ST          TIM_EN
---------- ---------- ---------- --------------- ---------------
QF1        B          Z          20030730 110000 20030730 120000
QF1        A          B          20030730 070000 20030730 100000
QF5        A          B          20030730 070000 20030730 090000
QF2        A          Z          20030730 080000 20030730 120000
QF7        O          Z          20030730 140000 20030730 150000
QF7        M          O          20030730 120000 20030730 130000

Again, not all flights arriving into Z pass through A on their way to Z. QF7 is a notable example, beginning at M and passing through O on the way to Z.

Finding Legs Common to Both A and Z

CONNECT BY queries have to start someplace. One of the things I found insightful about Nuno's solution is that he realized the need for two queries: one to fan out and follow flights radiating outwards from point A, and the other to fan out in reverse from point Z, tracing backwards the paths of arriving flights. Any chain of flight legs returned in common by these two queries represents a solution to the problem of getting from point A to point Z. For example, both queries returned legs from flight QF1, albeit in slightly different order:
Query radiating outwards from A:
QF1        A          B          20030730 070000 20030730 100000
QF1        B          Z          20030730 110000 20030730 120000

Query radiating inwards towards Z:
QF1        B          Z          20030730 110000 20030730 120000
QF1        A          B          20030730 070000 20030730 100000
You may need to think about this a bit, and perhaps run some of these queries yourself using the test data at the end of this article, but you should realize that the only way these two legs can possibly appear in both result sets is if they represent a solution for flying from A to Z during the specified time frame. Thus, the following INTERSECT query can be used to return all such flight legs:
select flt,loc_st,loc_en,tim_st,tim_en
from flegs
where tim_st between
         to_date('20030730 060000','yyyymmdd hh24miss') and
         to_date('20030730 150000','yyyymmdd hh24miss')
connect by prior loc_en = loc_st
   and prior tim_en < tim_st
start with loc_st = 'A'
   and tim_st >= to_date('20030730 060000','yyyymmdd hh24miss')
INTERSECT
select flt,loc_st,loc_en,tim_st,tim_en
from flegs
where tim_en between
         to_date('20030730 060000','yyyymmdd hh24miss') and
         to_date('20030730 150000','yyyymmdd hh24miss')
connect by prior loc_st = loc_en
   and prior tim_st > tim_en
start with loc_en = 'Z'
   and tim_en <= to_date('20030730 150000','yyyymmdd hh24miss');
For any set of flight legs *not* representing a solution for travelling from A to Z, there are only three possibilities:
  • The set of flight legs originates at A and does not pass through Z.
  • The set of flight legs terminates at Z, but does not pass through A on the way to Z.
  • The set of flight legs does not involve A or Z at all.
In the first case, the legs would be returned by the first query but not the second, and thus would be eliminated from the INTERSECT results. In the second case, the legs would be returned by the second query, but not the first, and likewise would be eliminated from the INTERSECT results. In the third case, the legs would not be returned by either query, and thus would not show in the INTERSECT results. Only legs forming part of solutions for getting from A to Z will be returned by the INTERSECT query. For example:
FLT        LOC_ST     LOC_EN     TIM_ST          TIM_EN
---------- ---------- ---------- --------------- ---------------
QF1        A          B          20030730 070000 20030730 100000
QF1        B          Z          20030730 110000 20030730 120000
QF2        A          Z          20030730 080000 20030730 120000
QF5        A          B          20030730 070000 20030730 090000
These results show three solutions for getting from A to Z:
  • Fly QF1 from A to B to Z.
  • Fly QF2 directly from A to Z.
  • Fly QF5 from A to B, and then fly QF1 from B to Z.
The first two solutions are obvious; the second less so. The INTERSECT query returns all legs that form part of an A to Z solution, but Nuno's application software still needs to, and does, do some work to determine just what the specific solutions are. The ordering that you see in these results, especially in the first two rows, is just happenstance. The application also needs to recognize that any given leg may form part of one solution, or multiple solutions.

What About the Time Period?

When I first examined Nuno's queries, I was a bit puzzled over the following parts of his WHERE clauses:
Query radiating outwards from A:
where tim_st between
         to_date('20030730 060000','yyyymmdd hh24miss') and
         to_date('20030730 150000','yyyymmdd hh24miss')

Query radiating inwards towards Z:
where tim_en between
         to_date('20030730 060000','yyyymmdd hh24miss') and
         to_date('20030730 150000','yyyymmdd hh24miss')
These predicates are not strictly necessary for proper operation of the query. They are an optimization. Given appropriate indexes on the tim_st and tim_en columns, these predicates enable the database to eliminate many rows from consideration before getting to the INTERSECT operation. So how does the query constrain solutions to the given time period? The key lies in the START WITH clauses:
The first query starts with the beginning time:

start with loc_st = 'A'
   and tim_st >= to_date('20030730 060000','yyyymmdd hh24miss')

The second query starts with the ending time:

start with loc_en = 'Z'
   and tim_en <= to_date('20030730 150000','yyyymmdd hh24miss');

It's possible for a chain of flight legs to leave A during the specified time period, yet reach Z after the specified ending time. Those legs will be picked up by the first query, but not by the second. Likewise, it's possible for a chain of flight legs to arrive at Z during the specified time period, but depart from A too early. Such legs will be picked up by the second query, but not the first. Thus, spurious legs can appear in the result sets of each separate subquery, but only *solutions* will appear in both subqueries, and thus only solutions will be returned by the INTERSECT query. There's also the two time conditions in the CONNECT BY clauses of the subqueries:
   and prior tim_en < tim_st
   
and:

   and prior tim_st > tim_en
These conditions ensure that the queries link legs in a rational time sequence. It does no good, for example, to fly into B only to find that your next flight left before you arrived! These conditions also prevent what Nuno refers to as the "loop" problem: a series of legs might come back to the origin of a flight and then take off again in another direction, still under the same overall flight "name". Without a way of "serializing" the time of each leg, a query encountering such looping legs would fail with a "loop detected" error message.

Applications of Recursive SQL

A CONNECT BY query really represents a recursive SQL statement. If you're not familiar with the concept of recursive SQL, you might want to read my July article on the topic: Is Yours A UNION Shop? Recursive SQL, and especially CONNECT BY, is often presented and explained in the context of containership. You might have a bill-of-materials listing the parts making up a car, with each car part, such as an engine, being made up of other parts, such as pistons and rings, and each of those parts in turn being composed of yet more parts, and so forth on down to the smallest and simplest of parts such as nuts and bolts and springs. Nuno's solution demonstrates that CONNECT BY queries have wider application than the canonical bill-of-materials example. In the case flight legs, CONNECT BY linked each leg with the next on the path from one point to another. The recursion had nothing to do with containership, but rather had to do with a series: here's a leg, I have to find the next, but now I have another leg, so I have to find the next, but now I have yet another leg, so I have to find the next. Nuno's solution is creative and insightful. I enjoyed tearing it apart and examining in detail how all the pieces worked together towards the goal of finding options for flying from point A to point Z. Frankly, it's opened my eyes to a whole new class of problems that can be solved using CONNECT BY. I thank him for letting me write about it, and I encourage you to experiment with it for yourself.

Acknowledgments

Many thanks to Nuno Souto for the very interesting SQL solution described in this article. Nuno came up with it. All I did was write about it.

Example Data for the FLEGS Table

You can use the following INSERT statements to populate the FLEGS table created in the article. The data given here is the same data as used in this article's examples.
insert into flegs values('QF1','A','B',
                  to_date('20030730070000','yyyymmddhh24miss'),
                  to_date('20030730100000','yyyymmddhh24miss'));
insert into flegs values('QF1','B','Z',
                  to_date('20030730110000','yyyymmddhh24miss'),
                  to_date('20030730120000','yyyymmddhh24miss'));
insert into flegs values('QF2','A','Z',
                  to_date('20030730080000','yyyymmddhh24miss'),
                  to_date('20030730120000','yyyymmddhh24miss'));
insert into flegs values('QF3','B','D',
                  to_date('20030730100000','yyyymmddhh24miss'),
                  to_date('20030730110000','yyyymmddhh24miss'));
insert into flegs values('QF3','D','E',
                  to_date('20030730120000','yyyymmddhh24miss'),
                  to_date('20030730130000','yyyymmddhh24miss'));
insert into flegs values('QF3','E','F',
                  to_date('20030730140000','yyyymmddhh24miss'),
                  to_date('20030730150000','yyyymmddhh24miss'));
insert into flegs values('QF4','F','Z',
                  to_date('20030730160000','yyyymmddhh24miss'),
                  to_date('20030730170000','yyyymmddhh24miss'));
insert into flegs values('QF5','A','B',
                  to_date('20030730070000','yyyymmddhh24miss'),
                  to_date('20030730090000','yyyymmddhh24miss'));
insert into flegs values('QF6','A','G',
                  to_date('20030730080000','yyyymmddhh24miss'),
                  to_date('20030730090000','yyyymmddhh24miss'));
insert into flegs values('QF6','G','H',
                  to_date('20030730100000','yyyymmddhh24miss'),
                  to_date('20030730110000','yyyymmddhh24miss'));
insert into flegs values('QF7','M','O',
                  to_date('20030730120000','yyyymmddhh24miss'),
                  to_date('20030730130000','yyyymmddhh24miss'));
insert into flegs values('QF7','O','Z',
                  to_date('20030730140000','yyyymmddhh24miss'),
                  to_date('20030730150000','yyyymmddhh24miss'));
COMMIT;