FlowIO Tutorial
https://flowio.readthedocs.io/en/latest/?badge=latest
FlowIO is a Python library for reading and writing flow cytometry standard (FCS) files. It is intended as a lightweight library, suitable for parsing FCS data sets (e.g. as a web server backend, for simple metadata extraction, etc.). It is highly recommended that one be familiar with the various FCS file standards (2.0, 3.0, 3,1) before using FlowIO for downstream analysis. For higher level cytometry analysis, please see the related FlowKit library which offers a much wider set of analysis options such as compensation, transformation, and gating support (including support for importing FlowJo 10 workspaces).
If you have any questions about FlowIO, find any bugs, or feel something is missing from the documentation please submit an issue to the GitHub repository here.
Table of Contents
[1]:
import flowio
[2]:
flowio.__version__
[2]:
'1.4.0'
A Primer on FCS File Sections
Before getting into the details on the FlowIO API, let’s go over some nomenclature used by the FCS specification for defining the various sections found in an FCS file. We’ll use the FCS 3.1 specification here, as this version is most commonly encountered these days. While there are differences between the FCS versions these basic section definitions are generally the same for other versions.
The FCS specification lists the following file sections and references their names using all capital letters. The FlowIO documentation uses these same conventions (all caps) when referencing these sections.
Segment Name |
Description |
|---|---|
HEADER |
Identifies the FCS version and describes the byte locations for the other segments in the data set. |
TEXT |
Contains a series of ASCII encoded keyword-value pairs that describe various aspects of the data set (a.k.a. metadata). |
DATA |
Contains the raw event data in one of three modes (list, correlated or uncorrelated) described by the |
ANALYSIS |
An optional segment that, when present, contains the results of data processing. The ANALYSIS segment has the same structure as the TEXT segment; i.e., it consists of a series of keyword-value pairs. There are no required keywords for the ANALYSIS segment. |
Notes on FCS Metadata
As mentioned above the TEXT segment contains metadata stored as keyword-value pairs. Some keywords are FCS defined and these contain the character prefix $, and only FCS-defined keywords are allowed to begin with the $ character. Keyword names are case-insensitive and can be any mixture of case, though FCS readers are instructed to ignore case. Some FCS-defined keywords are required while others are optional.
FlowData Class
A FlowData instance represents a single FCS file and is created from a local file name, file path, filehandle, or a pathlib Path object. FlowIO currently supports reading FCS 2.0, 3.0, and 3.1 files.
Let’s take a look at the FlowData constructor method:
FlowData(
filename_or_handle,
ignore_offset_error=False,
ignore_offset_discrepancy=False,
use_header_offsets=False,
only_text=False,
nextdata_offset=None,
null_channel_list=None,
)
filename_or_handle: a path string or a file handle for an FCS file
ignore_offset_error: option to ignore data offset error (see note below), default is False
ignore_offset_discrepancy: option to ignore discrepancy between the HEADER and TEXT values for the DATA byte offset location, default is False
use_header_offsets: use the HEADER section for the data offset locations, default is False. Setting this option to True also suppresses an error in cases of an offset discrepancy.
only_text: option to only read the “text” segment of the FCS file without loading event data, default is False
nextdata_offset: an integer indicating the byte offset for a data set, used for reading a data set from FCS file contain multiple data sets
null_channel_list: list of PnN labels corresponding to null channels
Note about FCS files with a data offset error:
Some FCS files incorrectly report the location of the last data byte as the last byte exclusive of the data section rather than the last byte inclusive of the data section. In short, the reported location of the last byte is off by one byte. Technically, these are invalid FCS files but are not corrupted data files. To attempt to read in these files, set the ignore_offset_error option to True.
Note on ``ignore_offset_discrepancy`` and ``use_header_offset``: The byte offset location for the DATA segment is defined in 2 places in an FCS file: the HEADER and the TEXT segments. By default, FlowIO uses the offset values found in the TEXT segment. If the HEADER values differ from the TEXT values, a DataOffsetDiscrepancyError will be raised. This option allows overriding this error to force the loading of the FCS file. The related use_header_offset can be used to force loading the
file using the data offset locations found in the HEADER section rather than the TEXT section. Setting use_header_offset to True is equivalent to setting both options to True, meaning no error will be raised for an offset discrepancy.
Create a FlowData Instance
Let’s create a FlowData instance from an FCS file path string.
[3]:
fcs_path = '../../data/fcs_files/data1.fcs'
[4]:
fd = flowio.FlowData(fcs_path)
[5]:
fd
[5]:
FlowData(data1.fcs)
Metadata and Channel Information
Get the FCS version of the file:
[6]:
fd.version
[6]:
'2.0'
All the keyword-value pairs in the TEXT segment are available via the text attribute. A few of these values are also available as dedicated attributes for convenience, we’ll get to those in a bit.
NOTE: The FlowData class stores TEXT keywords in lowercase regardless of how they were stored in the FCS file. Additionally, any FCS-defined keywords are stripped of their ``$`` character prefix. This is intentionally done for more convenient lookup so the user doesn’t have to remember which keywords are FCS-defined or worry about the case.
[7]:
fd.text
[7]:
{'byteord': '4,3,2,1',
'datatype': 'I',
'nextdata': '0',
'sys': 'Macintosh System Software 9.0.4',
'creator': 'CELLQuestª 3.3',
'tot': '13367',
'mode': 'L',
'par': '8',
'p1n': 'FSC-H',
'p1r': '1024',
'p1b': '16',
'p1e': '0,0',
'p1g': '3.67',
'p2n': 'SSC-H',
'p2r': '1024',
'p2b': '16',
'p2e': '0,0',
'p2g': '8',
'p3n': 'FL1-H',
'p3r': '1024',
'p3b': '16',
'p3e': '4,0',
'p4n': 'FL2-H',
'p4r': '1024',
'p4b': '16',
'p4e': '4,0',
'p5n': 'FL3-H',
'p5r': '1024',
'p5b': '16',
'p5e': '4,0',
'p1s': 'FSC-Height',
'p2s': 'SSC-Height',
'p3s': 'CD4 FITC',
'p4s': 'CD8 B PE',
'p5s': 'CD3 PerCP',
'p6n': 'FL2-A',
'p6r': '1024',
'p6b': '16',
'p6e': '0,0',
'timeticks': '100',
'p7n': 'FL4-H',
'p7r': '1024',
'p7e': '4,0',
'p7b': '16',
'p7s': 'CD8 APC',
'p8n': 'Time',
'p8r': '1024',
'p8e': '0,0',
'p8b': '16',
'p8s': 'Time (102.40 sec.)',
'sample id': 'Default Patient ID',
'src': 'Default',
'case number': 'Default Case Number',
'cyt': 'FACSCalibur',
'cytnum': 'E3820',
'btim': '16:31:33',
'etim': '16:31:52',
'bdacqlibversion': '3.1',
'bdnpar': '7',
'bdp1n': 'FSC-H',
'bdp2n': 'SSC-H',
'bdp3n': 'FL1-H',
'bdp4n': 'FL2-H',
'bdp5n': 'FL3-H',
'bdp6n': 'FL2-A',
'bdp7n': 'FL4-H',
'bdword0': '24',
'bdword1': '394',
'bdword2': '492',
'bdword3': '477',
'bdword4': '566',
'bdword5': '397',
'bdword6': '397',
'bdword7': '397',
'bdword8': '398',
'bdword9': '397',
'bdword10': '300',
'bdword11': '299',
'bdword12': '551',
'bdword13': '4',
'bdword14': '397',
'bdword15': '501',
'bdword16': '481',
'bdword17': '586',
'bdword18': '574',
'bdword19': '100',
'bdword20': '100',
'bdword21': '100',
'bdword22': '100',
'bdword23': '1',
'bdword24': '1',
'bdword25': '0',
'bdword26': '0',
'bdword27': '0',
'bdword28': '136',
'bdword29': '52',
'bdword30': '52',
'bdword31': '52',
'bdword32': '52',
'bdword33': '52',
'bdword34': '12',
'bdword35': '201',
'bdword36': '6',
'bdword37': '138',
'bdword38': '280',
'bdword39': '3',
'bdword40': '3',
'bdword41': '100',
'bdword42': '100',
'bdword43': '0',
'bdword44': '1023',
'bdword45': '1023',
'bdword46': '1023',
'bdword47': '53',
'bdword48': '550',
'bdword49': '56',
'bdword50': '72',
'bdword51': '52',
'bdword52': '0',
'bdword53': '0',
'bdword54': '0',
'bdword55': '0',
'bdword56': '0',
'bdword57': '0',
'bdword58': '0',
'bdword59': '0',
'bdword60': '0',
'bdword61': '0',
'bdword62': '0',
'bdword63': '0',
'bdlasermode': '1',
'calibfile': 'FALSE',
'p7thresvol': '52',
'fil': 'B07',
'date': '23-Aug-02',
'number well info keywords': '3',
'&1sample': '200',
'&2number of washes': '1',
'&3mixing vol': '100',
'&4number of mixes': '2',
'&5data file prefix part #1\\\\&6data file prefix part #2\\\\&7data file prefix part #3\\\\&8acquisition doc.': 'LYMPH SUBSET ACQ',
'&9instr. sett. file': 'E#7 Settings #1',
'&10patient id': ' FJ#192659',
'&11day': '35d',
'&12sample id': 'T-cells',
'&13analysis doc.': ''}
As mentioned above, certain metadata in the TEXT segment is available in other FlowData attributes. Most of these relate to event and channel metadata. For example event_count gives the number of events in the FCS file:
[8]:
fd.event_count
[8]:
13367
The data type for the event data is available via the data_type attribute, storing the single character data type code. The four allowed values are ‘I’ for unsigned binary integer, ‘F’ for single precision IEEE floating point, ‘D’ for double precision IEEE floating points, or ‘A’ for ASCII.
[9]:
fd.data_type
[9]:
'I'
The FCS file size (in bytes):
[10]:
fd.file_size
[10]:
216432
The number of channels of event data:
[11]:
fd.channel_count
[11]:
8
Channel Metadata
FCS defines several keyword sets for channel metadata. These numbered parameter keywords begin with the letter ‘P’ followed by a channel number and a third character denoting the type of channel metadata. For example, keywords of the for ‘PnN’ correspond to the required parameter names, where ‘n’ is a channel number (e.g. ‘P1N’ for the first channel’s name). We’ll go over several of these channel metadata sets.
The ‘PnN’ channel labels are found in the pnn_labels attribute. The order of these correspond to the channel order found in the event data.
[12]:
fd.pnn_labels
[12]:
['FSC-H', 'SSC-H', 'FL1-H', 'FL2-H', 'FL3-H', 'FL2-A', 'FL4-H', 'Time']
The ‘PnS’ labels are not required by FCS, but often contain useful channel information. The pns_labels attribute is guaranteed to have a matching length to that of the required pnn_labels list. Any missing PnS fields will contain an empty string. This is useful for retrieving channel metadata for the same channel index.
[13]:
fd.pns_labels
[13]:
['FSC-Height',
'SSC-Height',
'CD4 FITC',
'CD8 B PE',
'CD3 PerCP',
'',
'CD8 APC',
'Time (102.40 sec.)']
Additionally, there are the ‘PnR’ values containing the data range for each channel.
[14]:
fd.pnr_values
[14]:
[1024.0, 1024.0, 1024.0, 1024.0, 1024.0, 1024.0, 1024.0, 1024.0]
All the channel metadata needed for correct interpretation of the raw event data is summarized in an additional channel attribute, stored as a dictionary. The keys are channel numbers (not channel indices) and the values are dictionaries where the keywords are the channel metadata class types: ‘pnn’, ‘pns’, ‘pne’, ‘png’, and ‘pnr’.
[15]:
fd.channels
[15]:
{1: {'pnn': 'FSC-H',
'pns': 'FSC-Height',
'pne': (0.0, 0.0),
'png': 3.67,
'pnr': 1024.0},
2: {'pnn': 'SSC-H',
'pns': 'SSC-Height',
'pne': (0.0, 0.0),
'png': 8.0,
'pnr': 1024.0},
3: {'pnn': 'FL1-H',
'pns': 'CD4 FITC',
'pne': (4.0, 1.0),
'png': 1.0,
'pnr': 1024.0},
4: {'pnn': 'FL2-H',
'pns': 'CD8 B PE',
'pne': (4.0, 1.0),
'png': 1.0,
'pnr': 1024.0},
5: {'pnn': 'FL3-H',
'pns': 'CD3 PerCP',
'pne': (4.0, 1.0),
'png': 1.0,
'pnr': 1024.0},
6: {'pnn': 'FL2-A', 'pns': '', 'pne': (0.0, 0.0), 'png': 1.0, 'pnr': 1024.0},
7: {'pnn': 'FL4-H',
'pns': 'CD8 APC',
'pne': (4.0, 1.0),
'png': 1.0,
'pnr': 1024.0},
8: {'pnn': 'Time',
'pns': 'Time (102.40 sec.)',
'pne': (0.0, 0.0),
'png': 1.0,
'pnr': 1024.0}}
Finally, we have a few attributes to serve as helpers for distinguishing common parameter types found in cytometry data (scatter channels, fluorescence channels, and the time channel). These attributes are scatter_indices, fluoro_indices, and time_index. Note, these are indices (zero-indexed) and not channel numbers.
[16]:
fd.scatter_indices
[16]:
[0, 1]
[17]:
fd.fluoro_indices
[17]:
[2, 3, 4, 5, 6]
[18]:
fd.time_index
[18]:
7
Event Data
The FlowData class stores event data in the same unprocessed list mode as found in the FCS file. In general, this unprocessed data is not suitable for downstream analysis as the preprocessing steps are needed for proper interpretation of the channel data. However, the processed data is available as a 2-D NumPy array via the as_array method. This is done intentionally to minimize the memory usage of FlowData instances.
Get unprocessed event data as 1-D array from the events attribute:
[19]:
# only selecting the first few for demonstration
fd.events[:10]
[19]:
array('H', [323, 218, 220, 394, 267, 5, 183, 0, 70, 43])
Get the processed data as a 2-D NumPy array using the as_array method. First, let’s read the docstring.
[20]:
help(fd.as_array)
Help on method as_array in module flowio.flowdata:
as_array(preprocess=True) method of flowio.flowdata.FlowData instance
Retrieve the event data list as a 2-D NumPy array. Pre-processing is
applied if requested and includes applying gain, log, and time scaling
as necessary.
:param preprocess: Boolean for whether to apply gain, log, and time
scaling as necessary according the FCS metadata (default is True).
:return: NumPy array of 2-D event data
[21]:
# by default, it returns the preprocessed data
fd.as_array()
[21]:
array([[ 88.01089918, 27.25 , 7.23394163, ..., 5. ,
5.18613419, 0. ],
[ 19.07356948, 5.375 , 36.51741273, ..., 0. ,
4.29351021, 0. ],
[ 70.57220708, 26. , 2.48045441, ..., 0. ,
8.58210354, 0. ],
...,
[ 62.1253406 , 27.625 , 11.75743266, ..., 0. ,
1.77827941, 174. ],
[ 36.23978202, 64.5 , 5.42469094, ..., 0. ,
4.95806824, 174. ],
[ 66.48501362, 8.75 , 1.43301257, ..., 0. ,
6.0429639 , 174. ]])
[22]:
# set 'preprocess=False' to get a 2-D NumPy array of unprocessed data
fd.as_array(preprocess=False)
[22]:
array([[323., 218., 220., ..., 5., 183., 0.],
[ 70., 43., 400., ..., 0., 162., 0.],
[259., 208., 101., ..., 0., 239., 0.],
...,
[228., 221., 274., ..., 0., 64., 174.],
[133., 516., 188., ..., 0., 178., 174.],
[244., 70., 40., ..., 0., 200., 174.]])
Export as FCS
The FlowData class can also export the instance as a new FCS file using the write_fcs method. This is useful for modifying or removing certain metadata. Note, FlowIO only exports FCS files with $MODE ‘F’ (single precision floating point). If non-floating point data was loaded, the event data will be preprocessed and stored as ‘F’.
[23]:
help(fd.write_fcs)
Help on method write_fcs in module flowio.flowdata:
write_fcs(filename, metadata=None) method of flowio.flowdata.FlowData instance
Export FlowData instance as a new FCS file.
By default, the output FCS file will include the $cyt, $date, and $spill
keywords (and values) from the FlowData instance. To exclude these keys,
specify a custom `metadata` dictionary (including an empty dictionary for
the bare minimum metadata). Note: Any critical keywords related to the
interpretation of the event data are defined and set internally,
overriding those in the provided `metadata` dictionary. These keywords
include: PnB, PnE, and PnG.
:param filename: name of exported FCS file
:param metadata: an optional dictionary for adding metadata keywords/values
:return: None
Other FlowIO Features
The FlowData class is the main feature of FlowIO, however there are a few other useful features.
List of FCS Keywords
As mentioned in the section on FCS metadata, there are keywords that are predefined in the FCS specification. FlowIO includes a lookup list of these reserved keywords. There are 3 variables for all the reserved keywords, just the required keywords, and just the optional keywords. All 3 are found in the fcs_keywords module.
[24]:
flowio.fcs_keywords.FCS_STANDARD_KEYWORDS
[24]:
['beginanalysis',
'begindata',
'beginstext',
'byteord',
'datatype',
'endanalysis',
'enddata',
'endstext',
'mode',
'nextdata',
'par',
'tot',
'abrt',
'btim',
'cells',
'com',
'csmode',
'csvbits',
'cyt',
'cytsn',
'date',
'etim',
'exp',
'fil',
'gate',
'inst',
'last_modified',
'last_modifier',
'lost',
'op',
'originality',
'plateid',
'platename',
'proj',
'smno',
'spillover',
'src',
'sys',
'timestep',
'tr',
'vol',
'wellid']
[25]:
flowio.fcs_keywords.FCS_STANDARD_REQUIRED_KEYWORDS
[25]:
['beginanalysis',
'begindata',
'beginstext',
'byteord',
'datatype',
'endanalysis',
'enddata',
'endstext',
'mode',
'nextdata',
'par',
'tot']
[26]:
flowio.fcs_keywords.FCS_STANDARD_OPTIONAL_KEYWORDS
[26]:
['abrt',
'btim',
'cells',
'com',
'csmode',
'csvbits',
'cyt',
'cytsn',
'date',
'etim',
'exp',
'fil',
'gate',
'inst',
'last_modified',
'last_modifier',
'lost',
'op',
'originality',
'plateid',
'platename',
'proj',
'smno',
'spillover',
'src',
'sys',
'timestep',
'tr',
'vol',
'wellid']
Reading FCS Files with Multiple Datasets
Some FCS files contain multiple data sets within the same file. FlowIO supports reading in these files via the standalone read_multiple_data_sets function which returns a list of FlowData instances. Let’s review the docstring and then use the function to extract the data sets from an example file.
[27]:
help(flowio.read_multiple_data_sets)
Help on function read_multiple_data_sets in module flowio.utils:
read_multiple_data_sets(filename_or_handle, ignore_offset_error=False, ignore_offset_discrepancy=False, use_header_offsets=False, only_text=False)
Utility function for reading all data sets contained in an FCS file.
:param filename_or_handle: a path string or a file handle for an FCS file
:param ignore_offset_error: option to ignore data offset error (see above note), default is False
:param ignore_offset_discrepancy: option to ignore discrepancy between the HEADER
and TEXT values for the DATA byte offset location, default is False
:param use_header_offsets: use the HEADER section for the data offset locations, default is False.
Setting this option to True also suppresses an error in cases of an offset discrepancy.
:param only_text: option to only read the "text" segment of the FCS file without loading event data,
default is False
:return: List of FlowData instances for each found data set
The following example file has the “off by one” issue of incorrectly reporting the last byte location. We must set ignore_offset_error=True to open the file without throwing an error. Note, FlowIO will still emit a UserWarning indicating that the file should be reviewed.
[28]:
fd_list = flowio.read_multiple_data_sets("../../data/fcs_files/coulter.lmd", ignore_offset_error=True)
/home/swhite@dhe.duke.edu/git/flowio/src/flowio/flowdata.py:450: UserWarning: FCS file coulter.lmd reported incorrect data offset. Attempting to parse data section, but event data should be reviewed before trusting this file.
warn(warn_msg)
[29]:
len(fd_list)
[29]:
2
[30]:
fd_list
[30]:
[FlowData(coulter.lmd), FlowData(coulter.lmd)]
Creating FCS Files from Numerical Arrays
The standalone create_fcs function allows for the creation of new FCS file from numerical arrays. This can be useful for creating FCS files for test cases, saving processed events, or a subset of extracted event data. Let’s review the docstring and see an example of creating an FCS file from a 2-D array of randomly generated data.
[31]:
help(flowio.create_fcs)
Help on function create_fcs in module flowio.create_fcs:
create_fcs(file_handle, event_data, channel_names, opt_channel_names=None, metadata_dict=None)
Create a new FCS file from a list of event data.
Note:
A proper spillover matrix shall have the first value corresponding to the
number of compensated fluorescence channels followed by the $PnN names
which should match the given channel_names argument. All values in the
spill text string should be comma-delimited with no newline characters.
:param file_handle: file handle for new FCS file
:param event_data: list of event data (flattened 1-D list)
:param channel_names: list of channel labels to use for PnN fields
:param opt_channel_names: optional list of channel labels to use for PnS fields
:param metadata_dict: an optional dictionary for adding extra metadata keywords/values
:return:
Generate a synthetic data set of 2 separated clusters in 4 dimensions.
Note, we flatten the 2-D array to a list for input to the ``create_fcs`` function.
[32]:
import numpy as np
[33]:
# these clusters are clearly separated, each containing 2000 points
cluster1 = np.random.multivariate_normal(
[6000.0, 6000.0, 0.0, 3000.0],
np.array(
[
[600000, 300, 0, 0],
[300, 1000, 0, 0],
[0, 0, 1, 10],
[0, 0, 10, 1000]
]
),
(2000,)
)
cluster2 = np.random.multivariate_normal(
[-10.0, 0.0, 0.0, 0.0],
np.array(
[
[10000, 100, 0, 0],
[100, 10000, 0, 0],
[0, 0, 100000, 0],
[0, 0, 0, 1000]
]
),
(2000,)
)
data_set_points = np.vstack(
[
cluster1,
cluster2,
]
).flatten().tolist()
[34]:
# create required labels for the 4 channels
channel_names = [
'channel_A',
'channel_B',
'channel_C',
'channel_D'
]
# create a filehandle and save the data to a new FCS file
fh = open('create_fcs_example.fcs', 'wb')
flowio.create_fcs(fh, data_set_points, channel_names)
fh.close()
Open the newly created FCS file using the FlowData class
[35]:
fd_we_created = flowio.FlowData('create_fcs_example.fcs')
[36]:
fd_we_created.channel_count
[36]:
4
[37]:
fd_we_created.event_count
[37]:
4000
[38]:
fd_we_created.channels
[38]:
{1: {'pnn': 'channel_A',
'pns': '',
'pne': (0.0, 0.0),
'png': 1.0,
'pnr': 262144.0},
2: {'pnn': 'channel_B',
'pns': '',
'pne': (0.0, 0.0),
'png': 1.0,
'pnr': 262144.0},
3: {'pnn': 'channel_C',
'pns': '',
'pne': (0.0, 0.0),
'png': 1.0,
'pnr': 262144.0},
4: {'pnn': 'channel_D',
'pns': '',
'pne': (0.0, 0.0),
'png': 1.0,
'pnr': 262144.0}}
Custom Exceptions
FlowIO includes a few custom exception and warning classes, useful for catches FlowIO specific errors. All FlowIO defined warnings derive from the generic FlowIOWarning class. All FlowIO defined exceptions derive from the FlowIOException class.
PnEWarning
Warning for invalid PnE values when creating FCS files
FCSParsingError
Errors relating to parsing an FCS file
DataOffsetDiscrepancyError
Raised when an FCS file’s HEADER & TEXT section provide different byte offsets for the DATA section.
MultipleDataSetsError
Raised for errors related to FCS files containing more than one dataset, indicated by the ‘nextdata’ keyword.
[ ]: