Hi.
I am building a system for monitoring data quality.
It is building a daily history on all qvds saved in the prod environment.
It is getting a lot of the table info from metadata tables.
- field counts, unique vakues, row counts.
But I would like some more direct measures as well:
- numeric: columns: median , min, max, average, stdev, outliers
- for categorical cols: distrbution
The direct measures are a bit on the cpu intensive side, since it is fairly large data.
I am thinking to do the calculations when the table is created in order to save resources-to do it twice.
What I am doing now is looping over columns and calculating all measures on all columns. It takes it toll on systems.
Are there any smarthacks....or any techinal details and manouvers I can do do to make it easier...
BR Lasse