Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi, I need some help automating ETL code to run on multiple files in a folder, one at a time.
I have over a year's worth of data broken down into monthly files. We've been building these files over the year and the script has occasionally changed, causing some of these files to randomly have 2 extra fields. I need all the files to have these extra fields. The files either contain both or neither, and the fields are just a calculation from existing fields so no extra data is needed. I need the monthly dated file structure to remain the same – this script is just to add the fields if necessary.
This is across ~90 files with different naming conventions and fields, but the two I am trying to add are consistent across all files.. All are in one folder.
Filename syntax eg.
AgentSourceData_20141101.qvd
AgentSourceData_20141201.qvd
TeamSourceData_20151101.qvd
TeamSourceData_20151201.qvd
CentreSourceData_20151201.qvd
I need to be able to run a fairly automated script that:
Import a file,
Check if it has these extra columns
if true, discard data and import next file.
If false, add the columns
Export to the same filename, overwriting the existing file.
Discard data
Import the next file
etc.
I’ve written the code below which works, but obviously only on one specific file at a time. I need to know how to make it loop through all the files in the folder without me having to specify each one individually.
Test1:
Load *
FROM [AgentSourceData_20151101.qvd](qvd);
IF FieldNumber('DateYear','Test1')=0 then
Test2:
Load *
,Year(Date) as "DateYear"
,Month(Date) as "DateMonth"
Resident Test1;
Store Test2 into [AgentSourceData_20151101.qvd];
Drop Table Test2;
Drop table Test1;
Else
Drop table Test1;
Endif;
Test1:
Load *
FROM [AgentSourceData_20151201.qvd](qvd);
etc...
Hope that makes sense.
Cheers,
Dave
For such things you could use filelist to loop through your folders and then make your checks on topics like filetime/filesize or the contained fields whereby it's enough to read only one record with the first-statement and by qvd-files you could also use the file-functions like qvdfieldname() and qvdnooffields(). If you searched on filelist you will find many examples like this one: Re: Pick the latest excel file from folders
- Marcus
You want to load all the QVDs in folder and want to create the Year & Month from file name?
For such things you could use filelist to loop through your folders and then make your checks on topics like filetime/filesize or the contained fields whereby it's enough to read only one record with the first-statement and by qvd-files you could also use the file-functions like qvdfieldname() and qvdnooffields(). If you searched on filelist you will find many examples like this one: Re: Pick the latest excel file from folders
- Marcus
Thanks Marcus.
This was my finalised working script. I had to add some additional logic to look using the LEN function to look into the columns if they existed and recreate if there was no data in there as we had some files with the column headings but no data.
This is working, but any suggestions for optimisation / improvement are welcome.
For each vFileName in Filelist ('$(SDL_Directory)*')
Test1:
Load *
from [$(vFileName)] (qvd);
IF FieldNumber('DateYear','Test1')=0 then
Test2:
Load *
,Year(Date) as "DateYear"
,Month(Date) as "DateMonth"
Resident Test1;
Store Test2 into [$(vFileName)];
Drop Table Test2;
Drop table Test1;
Else
IF len(peek('DateYear',0,'Test1')) =0 then
drop fields DateYear, DateMonth;
Test2:
Load *
,Year(Date) as "DateYear"
,Month(Date) as "DateMonth"
Resident Test1;
Store Test2 into [$(vFileName)];
Drop Table Test2;
Drop table Test1;
ELSE
Drop table Test1;
ENDIF;
Endif;
ENDIF;
Next vFileName;
Cheers,
Dave
A possible optimization might be to load only the first record from Test1:
Test1:
First 1 Load ...
and then the second load Test2 to load again from the qvd. But you will need larger datasets to see here noticable differences.
But a bigger improvement could be to skip these task completely. Then it's often not necessary to hold a month/year field within a table if you have a date which could be later connected with a master-calendar which provides all needed period-fields.
- Marcus
Great advice on the 1st row. Some of these QVDs are huge, that will save significant time.
Without going into the boring detail, the month/year columns are needed for an daily automated extraction script. When they go through an ETL script once a week to bring into our dashboard these columns are indeed dropped and linked to a master calendar.
Cheers again.