Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
MrDolph
Contributor
Contributor

Generating 98 entries for each hospital in a sheet and predict/generate their behavior based on current and past trends

MrDolph_0-1684794787823.pngMrDolph_1-1684794803377.pngMrDolph_2-1684794815751.png

i have a 2 billion rows sheet that display the activity of over 500 hospitals and their respective reports of activity across 8 years and 2 months ranging from jan 2015 to feb 2023, most of the hospitals don't have reports for every single month, some even going as far as yielding a single report, a single month reported in the entire time span, i'm working with a sheet that has the CNES (identification number), UF (State), Cidade (City), Nome (Name), PR (procedure done), QTS (amount of times said procedure was done) Month, Year and a date format that i just concatenated in the script form to join yyyy / mm

i'd want to generate the missing months for every PR entry for every hospital (so that'd probably multiply the size of the sheet by quite a lot) so that every distinct procedure done from every distinct hospital with the 98 months and fill each of that months entry with the average value for that procedure for that hospital for that month/year

my primary goal is to find a trend and fill out most of the gaps, ideally, if a hospital has reported consistently for half of the time, even if past action doesn't guarantee future patterns, i'd want to find the average and maybe incorporate a standard deviation in the graph as a way to find trends

 

i have created a simple mastercalendar that covers all the 98 months and i'd like to start generating all the entries for PRs' months

tldr: a way to analyze how many PRs per month a hospital reports and plot out their most likely behavior on the months afterward based on previous behavior and general market behavior (that being the average of the entire dataset)

Labels (3)
3 Replies
marcus_sommer

I think there are mainly two approaches to populate missing data. One would be to join the appropriate dimensions to create a cartesian product of them. In your case it might be something like:

t. load distinct YearMonth from Calendar; join(t) load Hospital from HospitalDim;

resulting in about 50 k of records. Afterwards you could map the aggregated values from your facts to it whereby within the most scenarios the map-default value would be set to zero or null(). In your case by the aim of interpolating any min/max/avg values from another periods or similar hospitals you would need some next steps. This might be including further mappings derived from the above mentioned aggregations, for example could applymap() be nested several times so that the non-matching default-value is the next mapping.

Maybe some other aggregations are created and mapped/joined. Another way to fill missing values within the cartesian table could be to load the table resident again with an appropriate order by and checking with interrecord-functions like previous() and peek() the surrounding records and manipulating them like wanted.

The other approach of populating records would be to use exists() to add the non exists periods. This may look like:

t: load YearMonth & '|' & Hospital as Key, Value from X;
    concatenate(t)
    load YearMonth & '|' & Hospital, 0 as Value from Cartesian
    where not exists(Key, YearMonth & '|' & Hospital);

and afterwards also filling the missing values.

Which way might be the most suitable one may depend on various things but like here hinted I wouldn't tend to populate the missing data within the facts of 2 billions records else within an aggregated and associated table.

MrDolph
Contributor
Contributor
Author

in attaching the calendar to the procedures done (PR) how would i go about firstly: copying the existing fields onto the other hospitals:

Hospital Procedure done (PR) QTS Date
A A 1 Jan
A B 1 Jan
A C 1 Jan
A D 1 Jan
A A 1 Feb
A B 1 Feb
A C 1 Feb
A B 1 Mar
A D 1 Mar

 

as shown, like this one, pretty much every hospital has different reports for each month (in total 3713 distinct PRs). i wanted to look at all the procedures a given hospital has done and replicate all of them for each and every month (so yes, add for each hospital a multiple of all the PRs it has ever listed on top of assigning all the 98 months to each and every PR for each and every Hospital)

would that be too powerful to compile or could that be done? (current file has 2 billion rows) 

marcus_sommer

If I understand you right you need not only the combination of all hospitals and periods else to consider the PR, too. In the worst case it could mean 50.000 × 3.700 = 185 M of records but just taking the distinct combination of the hospital and PR for the cartesian should be much lesser. Maybe 20 in the average and resulting in about 1 M of records.

In this case I would think it should be further working and continue. In general all provided facts are no showstopper else it depends mainly on the available resources if its running or not. Nevertheless I suggest to split the task maybe for years and storing everything (facts, aggregations, populated data, ...) in qvds and the future data are then added per incremental logic.