How to do random sampling of rows in Synapse Analytics serverless SQL pool?
Is there any way in SQL to take a random sample of N rows (or M% if necessary) rows from a linked external data source, using the Azure Synapse Analytics serverless SQL pool?
Cryptographic functions are not available in the serverless SQL pool, so basically I can’t use RAND()
or CHECKSUM(NEWID())
, e.g. to define a condition to filter rows by comparing to my desired value of N (or M):
FUNCTION ‘RAND’ is not supported.
I can use HASHBYTES
against a field in my external data source, but my table doesn’t contain a unique ID per row.
SELECT title, director, HASHBYTES('sha2_256', title)
FROM external_table
Ideally the sampling should be statistically sound, rather than relying on some implementation detail of the database (like TABLESAMPLE
would, which anyway is also not available in the serverless pool).