Is test data same as train data?

Hi,
I checked test data has 1440 rows with bldg_id 1-1440 and training data has 7200 rows with bldg_id 1-7200. I checked randomly power consumption from testing and training data for same bldg_id show that the power consumption is exactly same. For example, the data below:


I am not sure if I downloaded the wrong data or it is already correct?
please advise

thank you

Hi @ramdhan_aw,

While there may be a few identical buildings in both the test and training datasets, the majority are different. If you find that the energy consumption for all buildings in your test dataset matches those in your training dataset, you might be comparing the wrong data or have downloaded incorrect datasets.

Best of luck Classifying the Buildings!
ThinkOnward Team

Hi,
I have checked the test dataset match 100% with training datasets. I tried to compared using mean squared error and it gives me zero which means they are identical. Would you mind to checking if the sources data is correct? I downloaded it from the data tab at the webpage challenge.
below is the plot power consumption from train and test with the same building id.

Hi @discourse-admin any clarification on this?

Hi @ramdhan_aw

Can you check that you are indeed comparing the test vs the train files? The train file is 2.76 GB and test is ~500 MB so they cannot be the exact same based on a first pass look at just file sizes. We have downloaded and checked both datasets and while there may be a few identical buildings in both the test and training datasets, most are different.

ThinkOnward Team

Hi @discourse-admin,
I believe, the downloaded files are correct. the sizes are same as you mentioned. Training folder has filenames (same name as building id) from 1 until 7200 and test from 1 until 1440. If we compared building_id from both test and train (for building id 1 until 1440), they have exactly same power consumption. would you mind to checking it again? I believe looking at the data content would be better compared to the file size comparison.

Hi @ramdhan_aw

We have downloaded both train and test datasets and checked them by comparing the energy consumption in all the files from the two datasets. We found that there are no file matches between the two datasets. These two datasets are completely separate without a single file match between them. Are there any other diagnostics that you can post to help to troubleshoot this issue?

ThinkOnward Team