SQL for Statisticians

Ingrid Aulike

I wanted to practice writing SQL and decided the data in R’s “nycflights13” package would be suitable. At a large airport, planes arrive and depart at all hours. My SQL query finds the number of aircraft that return to the airport within a certain time, after leaving during a specified period. My query finds the number of aircraft that left on the 1st of July 2019 and returned within 3 hours. Later I hope to generalise the query so users can select the period in which aircraft leave and the length of time until they return.

Here’s a toy dataset. SQL doesn’t label the table, but it is called “flights”.

If I’d just wanted to know which of the flights that arrived on the 1st of July 2019 departed within three hours, the code is simple.

But my query was to find the next arrival time after an aircraft’s departure. An individual record does not contain this information. If the data was sorted into order of arrival time within flight_ID, then what I needed was the arrival time from the following record in the sorted order, provided the following record in the sorted order was the same aircraft. SQL, like SAS, does not let you access individual values in a column. I considered using SQL’s LEAD() or LAG() but I’ve concluded that it’s not possible to use these within groups. In the end I realised that I needed to do a self join of the flights table to itself, which is what makes this code more than beginner stuff.

Here is my first attempt at the SQL query:

I joined the table to itself with “from flights as A, join flights as B”, where the connection along each row in the joined table is only among records with the same flight_ID, and where the departure time of a flight is before a subsequent arrival time of the flight, and where the date of departure is the 1st of July 2019, and where the difference between the arrival and departure times is less than 3 hours. This query gives the following table:

I was reasonably happy to keep some column headers as names from the original table, but “B.arrive_datetime” has an alias, assigned with “AS arrive_again_datetime”. The other columns with aliases are ID_A and ID_B. I kept these to help with debugging.

You can see that the table captures exactly the problem I intended it to when I made up the toy data: the problem of an aircraft, flight_ID = 4, re-arriving multiple times in the three hour period. My query has found, for flight_ID=4, every future flight arrival time within 3 hours after the 10:30pm departure and after the 11:30pm departure as well! For a particular aircraft’s particular arrival time, I actually only want the next arrival time after the first departure, not all future arrival times.

SQL uses GROUP BY to specify groups. The groups are formed from the individual rows which are still in the table after WHERE filters out unwanted records. In other words, GROUP BY does its work not on the 11 row flights table, but only on the 8 row table that has been subjected to elimination of rows with the WHERE clause. You can see in my next query below, the groups are made up of rows with common flight_ID and initial arrival time. In my query, I request the row with the earliest future arrival time, as shown in the updated SELECT statement below, with “MIN(B.arrive_datetime)”. The GROUP BY has told the MIN what groups to minimise the arrival time variable over, and MIN is finding the earliest time in each group.

I’ve given the minimum/earliest re-arrival time an alias, “AS next_arrive_datetime”, which will be an informative column heading in the output table.

So here we have the correct table:

For each flight_ID/arrive_datetime combination, only the earliest re-arrival time is listed.

The basic SQL syntax is SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY. HAVING is the only one not in my query. HAVING puts a condition on aggregates of the data, where the aggregates (like MIN, MAX, SUM, AVG, COUNT) work on the rows in each group defined by GROUP BY. Don’t put row constraints in a HAVING. Row constraints go in WHERE.

Here’s the end result of my code, with some tidying up. You might have been asked for just the number of flights that come back to the airport after leaving, but obviously you want to have a bit of an idea that your SQL query is doing what you think it’s doing before you create a table in order to count the rows. In the code below, you can see tables created, taking care they don’t already exist, and dropped. The development of the query is shown above. Once you’re happy, you create a table with the query in it as shown. The new table is “flights_query”. You can count the rows, or the request might be to just count the individual aircraft.

Here’s the count output:

Next I might work out how to do this in Power BI with DAX. Then, in Power BI, where the underlying data can be secured, I will make a dashboard/report where users can specify the time periods for leaving and returning.