ingest

A mechanism to ingest CSV files into a database.

In morphological profiling experiments, a CellProfiler pipeline is often run in parallel across multiple images and produces a set of CSV files. For example, imaging a 384-well plate, with 9 sites per well, produces 384 * 9 images; a CellProfiler process may be run on each image, resulting in a 384*9 output directories (each directory typically contains one CSV file per compartment (e.g. Cells.csv, Cytoplasm.csv, Nuclei.csv) and one CSV file for per-image measurements (e.g. Image.csv).

cytominer_database.ingest.seed can be used to read all these CSV files into a database backend. SQLite is the recommended engine, but ingest will likely also work with PostgreSQL and MySQL.

cytominer_database.ingest.seed assumes a directory structure like shown below:

plate_a/
set_1/
file_1.csv
file_2.csv
file_n.csv
set_2/
file_1.csv
file_2.csv
file_n.csv
set_m/
file_1.csv
file_2.csv
file_n.csv

Example:

import cytominer_database.ingest

cytominer_database.ingest.seed(source, target, config)
cytominer_database.ingest.checksum(pathname, buffer_size=65536)[source]

Generate a 32-bit unique identifier for a file.

Parameters:
  • pathname – input file
  • buffer_size – buffer size
cytominer_database.ingest.into(input, output, name, identifier, skip_table_prefix=False)[source]

Ingest a CSV file into a table in a database.

Parameters:
  • input – Input CSV file.
  • output – Connection string for the database.
  • name – Table in database into which the CSV file will be ingested
  • identifier – Unique identifier for input.
  • skip_table_prefix – True if the prefix of the table name should be excluded from the names of columns.
cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix=True)[source]

Read CSV files into a database backend.

Parameters:
  • config_file – Configuration file.
  • source – Directory containing subdirectories that contain CSV files.
  • target – Connection string for the database.
  • skip_image_prefix – True if the prefix of image table name should be excluded from the names of columns from per image table