TOP: ¿dentro o fuera de la vista? (y III): Mejor usar SAMPLE.

Para terminar con la saga que hemos venido desarrollando aquí y aquí vamos a ver que muchas veces el uso de ‘TOP’ es incorrecto o innecesario.

Esto es así porque a menudo lo que se anda buscando en sentencias con ‘TOP’ es un número de filas elegidas aleatoriamente, y aquí radica la gran diferencia: ‘TOP’ lleva implícitamente asociado el concepto de ORDEN. Teradata provee de la cláusula ‘SAMPLE’, la cual permite elegir una muestra (aleatoria) de las filas. Este ‘SAMPLE’, a diferencia de ‘TOP’, no lleva consigo ningún concepto de ordenación.

Así pues:

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
SELECT COUNT(1)
  FROM MY_DB.TABLA_GRANDE
;

 *** Query completed. One row found. One column returned.
 *** Total elapsed time was 11 seconds.

   Count(1)
-----------
   60279016

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
CREATE VIEW MY_DB.PRE_SAMPLE
AS
SELECT MY_COLUMN
  FROM MY_DB.TABLA_GRANDE
SAMPLE 10;

 *** View has been created.
 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
SELECT *
  FROM MY_DB.PRE_SAMPLE
;

 *** Query completed. 10 rows found. One column returned.
 *** Total elapsed time was 1 second.

MY_COLUMN
---------
B15273942
C1A283144
91625314D
V10253646
M15273145
N1A223041
91423354G
91829374B
B1A203242
B15223845

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
CREATE VIEW MY_DB.POST_SAMPLE
AS
SELECT MY_COLUMN
  FROM MY_DB.TABLA_GRANDE
;

 *** View has been created.
 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
SELECT *
  FROM MY_DB.POST_SAMPLE
  SAMPLE 10
;

 *** Query completed. 10 rows found. One column returned.
 *** Total elapsed time was 1 second.

MY_COLUMN
---------
M10283744
C15223846
M14293146
S1A253646
M17233944
C1S253744
M15233947
O11203747
91329384G
H1U263349

Los ‘EXPLAIN’ son parecidos, aunque no iguales (la ‘materialización’ de la vista con SAMPLE supone una diferencia):

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
EXPLAIN
SELECT *
  FROM MY_DB.PRE_SAMPLE
;

 *** Help information returned. 20 rows.
 *** Total elapsed time was 1 second.

Explanation
--------------------------------------------------------------------------------
  1) First, we lock a distinct MY_DB."pseudo table" for read on a
     RowHash to prevent global deadlock for
     MY_DB.TABLA_GRANDE.
  2) Next, we lock MY_DB.TABLA_GRANDE in view
     PRE_SAMPLE for read.
  3) We do an all-AMPs SAMPLING step from
     MY_DB.TABLA_GRANDE in view PRE_SAMPLE by way
     of an all-rows scan with no residual conditions into Spool 1
     (all_amps), which is built locally on the AMPs.  Samples are
     specified as a number of rows.
  4) We do an all-AMPs RETRIEVE step from Spool 1 (Last Use) by way of
     an all-rows scan into Spool 4 (group_amps), which is built locally
     on the AMPs.  The result spool file will not be cached in memory.
     The size of Spool 4 is estimated with low confidence to be
     60,196,840 rows (1,805,905,200 bytes).  The estimated time for
     this step is 12.40 seconds.
  5) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 4 are sent back to the user as the result of
     statement 1.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
EXPLAIN
SELECT *
  FROM MY_DB.POST_SAMPLE
  SAMPLE 10
;

 *** Help information returned. 14 rows.
 *** Total elapsed time was 1 second.

Explanation
--------------------------------------------------------------------------------
  1) First, we lock a distinct MY_DB."pseudo table" for read on a
     RowHash to prevent global deadlock for
     MY_DB.TABLA_GRANDE.
  2) Next, we lock MY_DB.TABLA_GRANDE in view
     POST_SAMPLE for read.
  3) We do an all-AMPs SAMPLING step from
     MY_DB.TABLA_GRANDE in view POST_SAMPLE by way
     of an all-rows scan with no residual conditions into Spool 2
     (group_amps), which is built locally on the AMPs.  Samples are
     specified as a number of rows.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 2 are sent back to the user as the result of
     statement 1.

Como vemos en el ejemplo, el funcionamiento de las dos ‘SELECT’ (con el ‘SAMPLE’ dentro o fuera de la vista) es bastante parecido en rendimiento (aunque no igual, como se ve en los ‘EXPLAIN’). Y mucho mejor que con ‘TOP’, siempre en el caso de que lo único que queramos sea sacar un número de filas aleatoriamente de una tabla sin preocuparnos de otros factores, como el orden o ‘ranking’ de las filas en cuestión.

Saludos.

Carlos.

1 Responses to TOP: ¿dentro o fuera de la vista? (y III): Mejor usar SAMPLE.

  1. Sí que ha llegado lejos en análisis. ¡Muchas gracias por el trabajo!

Deja un comentario