acutils package

Submodules

acutils.file module

acutils.file.copyfile(src, dst)[source]

Copy a file from src to dst without checking anything.

Parameters:
  • src ((str)) – absolute path to the file that will be copied.

  • dst ((str)) – absolute path to the future copied file.

Return type:

None

acutils.file.copyfile_with_safety(src, dst)[source]

Copy a file from src to dst and check if src file and dst directory are existing.

Parameters:
  • src ((str)) – absolute path to the file that will be copied.

  • dst ((str)) – absolute path to the future copied file.

Return type:

None

acutils.file.copyfiles_to_dir(srcs, dstdir, newfilename=None, safecopy=True)[source]

Copy multiple files from srcs to a dst directory using “copyfile_to_dir” function.

Parameters:
  • srcs ((iterable of str)) – absolute paths to the files that will be copied.

  • dstdir ((str)) – absolute path to the directory that should contain the copy.

  • newfilename=None ((str)) – filename expected, if None src filename is used.

  • safecopy=True ((bool)) – if True copyfile_with_safety is called, else copyfile is called.

Return type:

None

acutils.file.load_dict_from_json(src)[source]

Load a dictionary from a json file.

Parameters:

src ((str)) – absolute path to the json file.

Returns:

loaded data dictionary.

Return type:

(dict<str;str>) data

acutils.file.reset_directory(dirpath, subs=[])[source]

Delete directory if it exists, then create it again and fill it with empty subdirecories named from subs argument.

Parameters:
  • dirpath ((str)) – absolute path to the directory to reset.

  • subs ((list<str>)) – name of the subdirectories that should be in dirpath.

Return type:

None

acutils.file.save_dict_as_json(dst, data)[source]

Save a dictionary as a json file.

Parameters:
  • dst ((str)) – absolute path to the new json file.

  • data ((dict<str;str>)) – data dictionary to save.

Return type:

None

acutils.file.tmnt_copyfile_to_dir(src, dstdir, newfilename=None, safecopy=True)[source]

Copy a file from src to a dst directory.

Parameters:
  • src ((str)) – absolute path to the file that will be copied.

  • dstdir ((str)) – absolute path to the directory that should contain the copy.

  • newfilename=None ((str)) – filename expected, if None src filename is used.

  • safecopy=True ((bool)) – if True copyfile_with_safety is called, else copyfile is called.

Return type:

None

acutils.file.tmnt_generate_documentation(src, dstdir)[source]

Extract documentation inside python file and fill it into markdown file.

Parameters:
  • src ((str)) – absolute path to the python file.

  • dstdir ((str)) – absolute path to the new markdown file.

Return type:

None

acutils.gpu module

acutils.gpu.cupy_to_numpy(arr)[source]

Return a numpy.array from a cupy.array or a numpy.array.

Parameters:

arr ((numpy.array or cupy.array)) – array to convert as numpy one

(if not already).

Returns:

converted array.

Return type:

(numpy.array) convarr

acutils.gpu.select_device(device)[source]

Select device used for some GPU computations of the current process.

Parameters:

device ((int or None)) – selected GPU (if None, does nothing).

Return type:

None

acutils.gpu.set_gpu_computation(activate=True)[source]

Enable or disable GPU computation. To enable it, cupy and cucim modules are needed, it changes import as auski and aunp.

Parameters:

activate=False ((bool)) – activate or not GPU computation.

Return type:

None

acutils.handler module

class acutils.handler.DataHandler(datapath, file_extensions=None, allowed_cpus=1, seed=871, str_ndarray_dtype='U256')[source]

Bases: object

Main class to handle data on disk.

(str) datapath

Absolute path to the directory that contain source files.

(array/list like of str) file_extensions=None

Source file allowed extensions.

(int) allowed_cpus=1

Maximum amount of CPUs used to compute.

(int) seed=871

Seed used to initialize numpy randomizer.

(str) str_ndarray_dtype="U256"
Data type used for any string numpy arrays. It defines the maximum

length of strings, especially those in sheet files, for loading labels and groups.

(numpy.array<str_ndarray_dtype>) files=None

The filenames (with extension) of the files to load.

(numpy.array<str_ndarray_dtype>) labels=None

The labels of the files.

(numpy.array<str_ndarray_dtype>) unique_labels=None

The unique labels among the labels.

(numpy.array<str_ndarray_dtype>) groups=None

The groups of the files, all with the same will be in the same dataset split.

balance_datasets(tdata, vdata)[source]

Balance datasets so the amount of data is equal for each label. This is basically calling “_balance_dataset” method with tdata then vdata.

Parameters:
  • tdata ((dict<str;str>)) – Train dictionary with filename as key and label as value.

  • vdata ((dict<str;str>)) – Val dictionary with filename as key and label as value.

Returns:

  • (dict<str;str>) balanced_tdata – Train data without superfluous files to balance it.

  • (dict<str;str>) balanced_vdata – Val data without superfluous files to balance it.

load_data_fromdatapath()[source]

Load data files from data directory. Assuming that those files are directly inside the data directory. The filenames are stored as “files” attribute.

Parameters:

None

Return type:

None

load_groups_fromsheet(sheetpath, idcol, groupcol, clueless_words=None, require_full_filename_match=False)[source]

Load data groups from a sheet file. You must load files before calling this, you might call “load_data_fromdatapath”. The groups are stored as “groups” attribute.

Parameters:
  • sheetpath ((str)) – Absolute path to the sheet which contain information about data.

  • idcol=None ((str)) – Name of the column that contains at least a part of the filename.

  • groupcol=None ((str)) – Name of the column that contains groups, not used here.

  • clueless_words=None ((array/list like of str)) – Strings considered as None.

  • require_full_filename_match=False ((bool)) –

    If True, requires the value in the idcol to be exactly the filename,

    otherwise, if the value in the idcol is included in the filename, it is considered as a match. Note that if the idcol value is included in multiple filenames, it will be associated with the first one found, in descending length order.

Return type:

None

load_labeled_data_fromdatapath()[source]

Load data files and labels from data directory. Assuming that those files are inside subdirectories (named with unique labels). The filenames are stored as “files” attribute. The labels are stored as “labels” attribute and their unique values are stored as “unique_labels” attribute.

Parameters:

None

Return type:

None

load_labels_fromsheet(sheetpath, idcol, labelcol, othercols=None, clueless_words=None, delete_unlabeled_files=True, require_full_filename_match=False)[source]

Load data labels from a sheet file. You must load files before calling this, you might call “load_data_fromdatapath”. The labels are stored as “labels” attribute and their unique values are stored as “unique_labels” attribute.

Parameters:
  • sheetpath ((str)) – Absolute path to the sheet which contain information about data.

  • idcol=None ((str)) – Name of the column that contains at least a part of the filename.

  • labelcol=None ((str)) – Name of the column that contains labels, not used here.

  • othercols=None ((array/list like of str)) – Name of the other columns to keep.

  • clueless_words=None ((array/list like of str)) – Strings considered as None.

  • delete_unlabeled_files=True ((bool)) – If True, delete each file without label.

  • require_full_filename_match=False ((bool)) –

    If True, requires the value in the idcol to be exactly the filename,

    otherwise, if the value in the idcol is included in the filename, it is considered as a match. Note that if the idcol value is included in multiple filenames, it will be associated with the first one found, in descending length order.

Return type:

None

load_split()[source]

Load a dictionary from a json file.

Parameters:

src ((str)) – absolute path to the json file.

Returns:

loaded data dictionary.

Return type:

(dict<str;str>) data

make_datasets(trainpath, valpath, tdata, vdata, func=None, empty_dir=True, **kwargs)[source]

Run processes on the maximum amount of allowed CPUs to apply “func” function to each source file. “func” needs “src” and “dstdir” params (in acutils, those are prefixed with “tmnt”). **kwargs should be addionnal arguments to pass to the “func” function.

Parameters:
  • trainpath ((str)) – Absolute path to the destination files train directory.

  • valpath ((str)) – Absolute path to the destination files val directory.

  • tdata ((dict<str;str>)) – Train dictionary with filename as key and label as value.

  • vdata ((dict<str;str>)) – Val dictionary with filename as key and label as value.

  • func=None ((function)) –

    Treatment that will be applied on each source file it needs an absolute path to the source file “src” and absolute path to destination files directory “dstdir”. In acutils, any function prefixed with “tmnt” is usable.

    (bool) empty_dir=True:

    If True, reset destination directories and fill it with unique labels

    as subdirectories if defined.

**kwargs:

Arguments to pass to the “func” function.

Return type:

None

process(dirpath, func=None, empty_dir=True, **kwargs)[source]

Run processes on the maximum amount of allowed CPUs to apply “func” function to each source file. If “func” is None, just copy the file. “func” needs “src” and “dstdir” params (in acutils, those are prefixed with “tmnt”). **kwargs should be addionnal arguments to pass to the “func” function.

Parameters:
  • dirpath ((str)) – Absolute path to treated files directory.

  • func=None ((function)) –

    Treatment that will be applied on each source file it needs an absolute path to the source file “src” and absolute path to destination files directory “dstdir”. In acutils, any function prefixed with “tmnt” is usable.

    (bool) empty_dir=True:

    If True, reset destination directory and fill it with unique labels

    as subdirectories if defined

**kwargs:

Arguments to pass to the “func” function.

Return type:

None

save_split(dst, data)[source]

Save a split (from “split” method) as a json file.

Parameters:
  • dst ((str)) – absolute path to the new json file.

  • data ((dict<str;str>)) – data dictionary to save.

Return type:

None

split(train_percentage=0.7, balance=False, ignore_groups=False)[source]

Split labeled data into train and test datasets.

Parameters:
  • train_percentage=0.7 ((float)) – Percentage of data expected in train dataset.

  • balance=False ((bool)) – Do call “balance_datasets” method before returning dictionaries.

  • ignore_groups=False ((bool)) – If True, ignore groups for the split, even though it is defined. If the “groups” attribute is not define, then it is ignored anyway. If it is defined and “ignore_groups” is False, then the split is done calling “_split_using_groups” method.

Returns:

  • (dict<str;str>) tdata – train dictionary with filename as key and label as value.

  • (dict<str;str>) vdata – val dictionary with filename as key and label as value.

acutils.image module

acutils.image.tmnt_resize_file(src, dstdir, new_width=224, new_height=224)[source]

Load image file from src, resize, then save it into dstdir.

Parameters:
  • src ((str)) – Absolute path to the file that will be processed.

  • dstdir ((str)) – Absolute path to the directory that should contain new files.

  • new_width=224 ((int)) – Expected width resize.

  • new_height=224 ((int)) – Expected height resize.

Return type:

None

acutils.multiprocess module

acutils.multiprocess.distribute(srcs, dstdirs, allowed_cpus=1, seed=871)[source]

Distribute files to process and split them between allowed CPUs. The distribution is returned as 2 lists of lists of src or dstdir.

Parameters:
  • srcs ((array/list like of str)) – Source files.

  • dstdirs ((array/list like of str)) – Destination directories.

  • allowed_cpus=1 ((int)) – Maximum amount of CPUs used to compute.

  • seed=871 ((int)) – Seed used to initialize numpy randomizer.

Returns:

  • (list<list<str>>) packed_srcs – Source files absolute paths per process.

  • (list<list<str>>) packed_dstdirs – Destination directories absolute paths per process.

acutils.multiprocess.run_processes_on_multiple_files(packed_srcs, packed_dstdirs, func, allowed_cpus=1, **kwargs)[source]

Run processes on the maximum amount of allowed CPUs to apply “func” function to each source file. “func” needs “src” and “dstdir” params (in acutils, those are prefixed with “tmnt”). **kwargs should be addionnal arguments to pass to the “func” function.

Parameters:
  • packed_srcs ((list<list<str>>)) – Source files absolute paths per process.

  • packed_dstdirs ((list<list<str>>)) – Destination directories absolute paths per process.

  • func ((function)) – Treatment that will be applied on each source file it needs an absolute path to the source file “src” and absolute absolute path to destination files directory “dstdir” in acutils, any function prefixed with “tmnt” is usable.

  • allowed_cpus=1 ((int)) – Maximum amount of CPUs used to compute.

  • **kwargs (Arguments to pass to the "func" function.) –

Return type:

None

acutils.pathology module

acutils.pathology.browse_segments(slide, bw, lvl=0, required_area=10000, do_harmonize=False)[source]

Find slices inside the preview and load it from de slide.

Parameters:
  • slide ((openslide.OpenSlide)) – The slide.

  • bw ((numpy.array of cupy.array of bool)) – Binary image from the preview.

  • lvl=0 ((int)) – level taken.

  • required_area=10000 ((int)) – Minimum area (pixels) to keep the slice.

  • do_harmonize=False ((bool)) – Harmonize slices using harmonize function.

Returns:

Found slices.

Return type:

(generator of numpy.array of uint8) Segments

acutils.pathology.browse_tiles(segment, size=512, blank_tol=0.35)[source]

Regulary crop a segment into multiple tiles of same size.

Parameters:
  • segment ((numpy.array of int)) – Slice to tile.

  • size=512 ((int)) – Size of each tile.

  • blank_tol=0.35 ((float)) – Percentage of white pixels tolerated for a tile.

Returns:

Tiles coordinates (x,y).

Return type:

(generator of tuple of int) coordinates

  • tiles generator of numpy.array of int): Keeped tiles.

acutils.pathology.get_cleaned_binary(preview, fp_val=16, sigma=2)[source]

Split foreground (1) and background (0) using mathematic morphology.

Parameters:
  • preview ((numpy.array of cupy.array of int)) – Fully loaded slide with low resolution.

  • fpval=16 ((int)) – Value that is used to creates footprints for segmentation.

  • sigma=2 ((float)) – Value for gaussian filter (applied on the slide before segmentation).

Returns:

Binary image from the preview.

Return type:

(numpy.array of cupy.array of bool) bw

acutils.pathology.get_preview(slide, lvl=-1, divider=None)[source]

Read the slide and returns it with low resolution.

Parameters:
  • slide ((openslide.OpenSlide)) – The slide.

  • lvl=-1 ((int)) – Level taken.

  • divider=None ((float)) – Scale the preview by dividing the slide. The new width and height will be the old ones divided by the divider. It is also the size of the area used to load the slide, to ensure no memory overload.

Returns:

Scaled loaded slide.

Return type:

(numpy.array or cupy.array of int) preview

acutils.pathology.harmonize(img)[source]

Harmonize the image through hed conversion, erase the color and keep the stains. Then match the histogram with a reference image. You can change the reference image calling update_ref_image function.

Parameters:

img ((numpy.array or cupy.array of int)) – RGB image to harmonize.

Return type:

None

References

[1] A. C. Ruifrok and D. A. Johnston:

“Quantification of histochemical staining by color deconvolution.,” Analytical and quantitative cytology and histology / the International Academy of Cytology [and] American Society of Cytology, vol. 23, no. 4, pp. 291-9, Aug. 2001.

acutils.pathology.load_slice_preview_and_ext(src, lvl, divider=None, device=None)[source]

Load slide file and extract a preview of it using get_preview function.

Parameters:
  • src ((str)) – Absolute path to the slide.

  • lvl ((int)) – Slide level taken.

  • divider=None ((float)) – Scale the preview by dividing the slide.

  • device=None ((int)) – Taken GPU.

Returns:

  • (openslide.OpenSlide) slide – The slide.

  • (numpy.array of cupy.array of int) preview – Fully loaded slide with low resolution.

  • (str) slide_ext – Slide extension.

acutils.pathology.mask_rgb(rgb, mask)[source]

Mask rgb image with a binary image.

Parameters:
  • rgb ((numpy.array or cupy.array of int)) – RGB image.

  • mask ((numpy.array or cupy.array of bool)) – Binary image.

Returns:

RGB image masked with the binary image.

Return type:

(numpy.array or cupy.array of uint8) masked_rgb

acutils.pathology.tmnt_harmonize(src, dstdir)[source]

Read a segment, harmonize it calling harmonize function, then save it.

Parameters:
  • src ((str)) – Absolute path to the file that will be processed.

  • dstdir ((str)) – Absolute path to the directory that should contain the new file.

Return type:

None

acutils.pathology.tmnt_save_preview_from_slide(src, dstdir, lvl, divider=None, ext='png', device=None)[source]

Load slide preview and save it.

Parameters:
  • src ((str)) – Absolute path to the slide.

  • dstdir ((str)) – Absolute path to the directory where to save the preview.

  • lvl ((int)) – Slide level taken.

  • divider=None ((float)) – Scale the preview by dividing the slide.

  • ext="png" ((str)) – Saved preview extension.

  • device=None ((int)) – Taken GPU.

Return type:

None

acutils.pathology.tmnt_save_segments_from_slide(src, dstdir, lvlpreview, lvlsegment, fpval, sigma, do_harmonize=False, divider=None, ext='png', device=None)[source]

Find slices inside the slide and save them.

Parameters:
  • src ((str)) – Absolute path to the slide.

  • dstdir ((str)) – Absolute path to the directory where to save the preview.

  • lvlpreview ((int)) – Slide level taken for the preview.

  • lvlsegment ((int)) – Slide level taken for the segment.

  • fpval ((int)) – Value that is used to creates footprints for segmentation.

  • sigma ((float)) – Value for gaussian filter (applied on the slide before segmentation).

  • do_harmonize=False ((bool)) – Harmonize slices using harmonize function.

  • divider=None ((float)) – Scale the preview by dividing the slide.

  • ext="png" ((str)) – Saved segments extension.

  • device=None ((int)) – Taken GPU.

Return type:

None

acutils.pathology.tmnt_save_tiles_from_segment(src, dstdir, size=512, blank_tol=0.35, ext='png')[source]

Find slices inside the slide and save them.

Parameters:
  • src ((str)) – Absolute path to the slide.

  • dstdir ((str)) – Absolute path to the directory where to save the preview.

  • size=512 ((int)) – Size of each tile.

  • blank_tol=0.35 ((float)) – Percentage of white pixels tolerated for a tile.

  • ext="png" ((str)) – Saved tiles extension.

Return type:

None

acutils.pathology.update_ref_image(img_path=None)[source]

Change reference image using when calling harmonize function. If you called gpu.set_gpu_computation(activate=True), calling will load the image on the GPU (using cupy instead of numpy).

Parameters:

img_path=None ((str)) – Absolute path to the image that will be the reference to harmonize.

Return type:

None

acutils.sheet module

acutils.sheet.add_suffix_to_cells_from_a_column(df, suffix, columns, inplace=False)[source]

Concatenate a string with the values in certain columns of a Pandas DataFrame.

Parameters:
  • df ((pandas.DataFrame)) – The DataFrame to modify.

  • suffix ((str)) – The string to concatenate with the values in the specified columns.

  • columns ((list)) – A list of column names to concatenate with the string.

  • inplace=False ((bool)) – If True, the DataFrame will be modified in place, else it is returned.

Returns:

Modified copy of passed df, if inplace==True returns None.

Return type:

(pandas.DataFrame) df=None

acutils.sheet.delete_clueless_rows(df, clueless_words=None, columns=None, inplace=False)[source]

Delete rows of a Pandas DataFrame based on the values in clueless_words. Delete each row that have any empty cell or any cell that contain a clueless_word. If columns argument is specified, only those columns will be concerned.

Parameters:
  • df ((pandas.DataFrame)) – The DataFrame to filter.

  • clueless_words=None ((list<str>)) – Any cell containing a word of the list is treated as empty.

  • columns=None ((list<str>)) – Columns considered for the deletion.

  • inplace=False ((bool)) – If True, the DataFrame will be modified in place, else it is returned.

Returns:

Filtered copy of passed df, if inplace==True returns None.

Return type:

(pandas.DataFrame) df=None

acutils.sheet.read_df_from_any_avalaible_extensions(sheetpath)[source]

Load a Pandas DataFrame from a file. Avalaible extensions: csv, txt, xls, xlsx, feather, parquet, hdf5, sas7bdat, stata, pickle.

Parameters:

sheetpath ((str)) – Absolute path of the file to load.

Returns:

The DataFrame loaded from the file.

Return type:

(pandas.DataFrame) df

Raises:

(ValueError) err: – If the file extension is not supported.

acutils.video module

acutils.video.browse_frames_for_binary_classification(src, bound, extra_down, extra_up, start=0, end=None)[source]

Browse each frame of a video, from “start” to “end”. Skip from bound-extra_down to bound+extra_up.

Parameters:
  • src ((str)) – absolute path to the video

  • bound ((int)) – frame number that switch the status from “before” to “after”

  • extra_down ((int)) – how many frames to skip before the bound

  • extra_up ((int)) – how many frames to skip after the bound

  • start=0 ((int)) – frame number to start browsing frames

  • end=None ((int)) – frame number to end browsing frames (if None, go for the full duration)

Returns:

  • (generator of bool) passed – True from entering inside the skipped range to the end

  • (generator of int) count – frame numbers

  • (generator of numpy.array of uint8) frame – BGR frames of the video

acutils.video.tmnt_extract_frames_for_binary_classification(src, dstdir, bound, extra_down, extra_up, start=0, end=None, fext='png', before_name='before', after_name='after')[source]

Extract frames of a video, from “start” to “end”. Skip from bound-extra_down to bound+extra_up. Split them into 2 states “before” or “after” the skipped range.

Parameters:
  • src ((str)) – absolute path to the video

  • dstdir ((str)) – absolute path to the directory that should contain the frames

  • bound ((int)) – frame number that switch the status from “before” to “after”

  • extra_down ((int)) – how many frames to skip before the bound

  • extra_up ((int)) – how many frames to skip after the bound

  • start=0 ((int)) – frame number to start browsing frames

  • end=None ((int)) – frame number to end browsing frames (if None, go for the full duration)

  • fext="png" ((str)) – frame extension

  • before_name="before" ((str)) – name of the subdir inside dstdir to save frames before skip

  • after_name="after" ((str)) – name of the subdir inside dstdir to save frames after skip

Return type:

None

acutils.video.tmnt_extract_sequences_for_binary_classification(src, dstdir, bound, extra_down, extra_up, seqlen=6, start=0, end=None, sext='mp4', codec='mp4v', fps=12, before_name='before', after_name='after')[source]

Extract sequences of a video, from “start” to “end”. Skip from bound-extra_down to bound+extra_up. Split them into 2 states “before” or “after” the skipped range.

Parameters:
  • src ((str)) – absolute path to the video

  • dstdir ((str)) – Absolute path to the directory that should contain the frames.

  • bound ((int)) – Frame number that switch the status from “before” to “after”.

  • extra_down ((int)) – How many frames to skip before the bound.

  • extra_up ((int)) – How many frames to skip after the bound.

  • seqlen=6 ((int)) – Amount of frames per sequence.

  • start=0 ((int)) – Frame number to start browsing frames.

  • end=None ((int)) – Frame number to end browsing frames (if None, go for the full duration).

  • sext="mp4" ((str)) – Sequence extension.

  • codec="mp4v" ((str)) – Sequence codec, used for: fourcc = cv2.VideoWriter_fourcc(*codec).

  • fps=12 ((int)) – saved Sequence frames per second.

  • before_name="before" ((str)) – Name of the subdir inside dstdir to save frames before skip.

  • after_name="after" ((str)) – Name of the subdir inside dstdir to save frames after skip.

Module contents