Dataloader error in the sample notebook

from datasets import Dataset 

from utils import CustomDataset
# create menu of slices from the 3D volumes
pretrain_dataset = CustomDataset(
    root_dir="./training_data",
    num_slices=300,  # number of slices in one direction of the cube; 300 for
    cheesecake_factory_mode=True,  # much like the menu at Cheesecake Factory, you slice menu will include everything :-)
    # limit=50,  # if cheese_cake_factory_mode=False, the limit must be an int value to limit the menu
    data_prefix="seismicCubes",  # what does the data name start with
    label_prefix="",  # what does the label name start with; leave label_prefix blank when pretraining a model.
    pretraining=True,
)


# this function will be used to iterate through all 2D slices created in the CustomDataset
def gen():
    for idx in range(len(pretrain_dataset)):
        yield pretrain_dataset[idx]


# using from_generator() we import the torch dataset to a HF dataset
hf_dataset = Dataset.from_generator(gen)

dataloader = DataLoader(hf_dataset, batch_size=16, num_workers=4)

Error after generating 600 examples

Generating train split: 600 examples [00:00, 895.05 examples/s]

builder.py 1786 _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e

datasets.exceptions.DatasetGenerationError:
An error occurred while generating the dataset

Hi @faruanacabiano. These issues maybe related to the shape issues encountered by @moruridarius. Please review this post for a way to troubleshoot.

ThinkOnward Team

It doesn’t seem to help.
It would be great if you can provide a sample notebook and utils file that works out of the box.
Thank you and looking forward!

Hi @faruanacabiano. Our apologies for the issue. The team is working diligently to identify and resolve the issue. We’ll report back here when we have a solution.

ThinkOnward Team

Thank you for your prompt response!

I managed to get something to work with this snipped.

      else:
            if self.slice_menu["axis"][idx] == "i":
                data = np.load(
                    self.slice_menu["data"][idx], allow_pickle=True, mmap_mode="r+"
                )
                if data.shape[1] == 1259:
                    data = data.transpose(0, 2, 1)  # Transpose to (1259, 300, 300)
                if data.shape[0] == 1259:
                    data = data.transpose(2, 1, 0)  # Transpose to (1259, 300, 300)
                assert data.shape == (300, 300, 1259), f"Unexpected shape: {data.shape}"

                data = data[int(self.slice_menu["idx"][idx]), ...]
            else:
                data = np.load(
                    self.slice_menu["data"][idx], allow_pickle=True, mmap_mode="r+"
                )
                if data.shape[1] == 1259:
                    data = data.transpose(0, 2, 1)  # Transpose to (1259, 300, 300)
                if data.shape[0] == 1259:
                    data = data.transpose(2, 1, 0)  # Transpose to (1259, 300, 300)
                assert data.shape == (300, 300, 1259), f"Unexpected shape: {data.shape}"

                data = data[:, int(self.slice_menu["idx"][idx]), :]
            data = rescale_volume(data)
            data = data[np.newaxis, :, :]
            data = np.repeat(data, 3, axis=0)

            data = torch.from_numpy(data).long()

            return data

But, I am still unsure if this is the right approach.
In particular, if len(a_pure)=len(a_noised) = 300, the correct orientation could be a = a_noise or a = a_noise[::-1]

The model will learn a low of wrong things if we make a wrong choice here. I am already seeing weird spikes.

Still no solution on this one? @discourse-admin

Hi @nadsoncarliva

We recommend the approach that @faruanacabiano posted above to verify that the volumes have the correct dimensions.

ThinkOnward Team

Hi, so what is the expected output shape from the CustomDataset? Is it (300,300,1259) like asserted by @faruanacabiano ? or is it (3, 300, 300)?