1 Settings which may be necessary in emacs

I have been messing with my emacs setup to make this project easier.

(defvar elpy-rpc-virtualenv-path 'current)
(pyvenv-activate "/sw/local/fiji/202211")
(pyvenv-activate "venv")

2 Introduction

Jacques is seeking to segment and follow cells over time.

Steps performed so far:

  1. Python/venv installation of fiji
  2. Figured out some easy interactions between python and actual image data.
  3. Implemented a simple function to find minimum distances between putative cells using the extant methods.

3 Next steps

  1. Note that the czi images are immediately openable via fiji’s bioformat interface.
  2. Load dataset via fiji
  3. Invoke cellpose via the python interface (same as pyimagej)
  4. Save ROIs produced by cellpose, save them to zip output file
  1. The roi interface in fiji can address these
  1. Currently a macro is invoked which performs a set of measurements on every ROI and saves them to a csv. This creates the primary data structure used for all following processes.
  2. Need to create a datastructure which identifies each individual cell over time given these ROIs which x/y/time(frame) along with the measurements of interest (area/mean/stdev/intensities by color).
    1. Small postprocessing details: the intensity values produced must be normalized by cell area.

4 Demonstrate the first couple of steps using an actual dataset

Jacques sent me an image acquired from the microscope and I saved it as ‘test_data/raw.tif’, he also sent me a cellpose model which I saved to the ‘models/’ directory.

Given these as a starting point, let us try to open the image with a fiji instance and split it into a series of smaller files by timepoint.

4.1 Load necessary python modules

import os
import imagej
import pandas
import numpy as np
import geopandas
import matplotlib.pyplot as plt
import seaborn
import scyjava
import multiprocessing as mp

4.2 Set up some initial variables

base_dir = '/home/trey/scratch/jacques'
os.chdir(base_dir)
input_file = f"{base_dir}/test_data/raw.tif"
output_dir = f"{base_dir}/outputs"

4.3 Start imagej/fiji

Note that I am using fiji/imagej from within a virtual environment which I symbolically linked to the local directory ‘venv/’.

scyjava.config.add_option('-Xmx64g')
ij = imagej.init('venv/bin/Fiji.app', mode = 'interactive')
ij.getVersion()
## '2.9.0/1.53t'
os.chdir(base_dir)

4.4 Open the input file and split it by time

Open the file, figure out its dimensions, and write portions of it to an output directory.

I wrote this thinking I could parallelize the output writing to 8 cpus. But I think I do not yet understand how python scopes variables in this context and so it did not quite work. I turned that off for the moment and ran it and it finished in about a minute.

def separate_slices(input_file, output_base = 'outputs', wanted_x = True, wanted_y = True,
                    wanted_z = 1, wanted_channel = 2, cpus = 8):
    pool = mp.Pool(cpus)
    input_base = os.path.basename(input_file)
    input_dir = os.path.dirname(input_file)
    input_name = os.path.splitext(input_base)[0]
    output_directory = f"{input_dir}/outputs/{input_name}_substack"
    os.makedirs(output_directory, exist_ok = True)
    print("Starting to open the input file, this takes a moment.")
    dataset = ij.io().open(input_file)
    print(f"Opened input file, writing images to {output_directory}")

    data_info = {}
    for element in range(len(dataset.dims)):
        name = dataset.dims[element]
        data_info[name] = dataset.shape[element]
    print(f"This dataset has dimensions: X:{data_info['X']} Y:{data_info['Y']} Z:{data_info['Z']} Time:{data_info['Time']}")

    def save_slice(timepoint):
        wanted_slice = dataset[:, :, wanted_channel, wanted_z, timepoint]
        slice_image = ij.py.to_dataset(wanted_slice)
        output_filename = f"{output_directory}/{input_name}_{timepoint}.tif"
        saved = ij.io().save(slice_image, output_filename)
        message = f"Saving image {input_name}_{timepoint}."
        print(message)

    for timepoint in range(data_info['Time']):
        saved = save_slice(timepoint)
        ## I want to see if I can split the saves across 8 cpus.
        ## pool.apply(save_slice, args = (timepoint, ))
    ## pool.close()
    return dataset

raw_split = separate_slices(input_file)

4.5 Create

5 Make an initial implementation of following

  1. Split the table into a list of dataframes which include a column containing the minimum distance and ID of the mindist element in the next/previous element of the list.
start <- openxlsx::readWorkbook("distance_calculation/exp0329scene5.xlsx")
readr::write_tsv(x=start, file="distance_calculation/testing.tsv")

5.1 Absolute value distances among cells

Lets make a quick implementation of an absolute distance of each cell over time in the hopes that I can better understand the question.

positions <- start[, c("X", "Y")]
colnames(positions) <- c("x", "y")
window <- ripras(positions)
cell_positions <- as.ppp(positions, W = window)
## Warning: data contain duplicated points
marks(cell_positions) <- start
unitname(cell_positions) <- "micrometer"

## Get every pairwise distance between two cells.
all_distances <- nndist(cell_positions, by = "ID")
## Error: The name 'ID' does not match any column of marks
## Get the cell ID which in the nearest neighbor in every frame for every cell.
closest_by_time <- nnwhich(cell_positions, by = "Frame")
## Note that the first column is a special case
closest_by_time[, 1] <- 1:nrow(closest_by_time)
## Track the first cell over time
cell1_closest <- closest_by_time[1, ]
cell1_closest
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    1   19   39   58   78   97  117  137  156  176  195  214  234  254  274  294 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  314  334  354  372  391  411  430  451  471  489  508  528  545  564  584  604 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  625  647  668  690  712  734  756  780  799  821  843  865  887  910  929  950 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##  971  988 1011 1030 1049 1068 1086 1104 1119 1137 1156 1175 1194 1213 1231 1249 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
## 1267 1285 1303 1320 1339 1359 1379 1399 1415 1432 1451 1469 1489 1506 1523 1540 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
## 1556 1574 1590 1607 1621 1636 1651 1667 1682 1697 1713 1728 1743 1758 1773 1788 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
## 1803 1818 1834 1850 1866 1881 1897 1912 1928 1941 1958 1975 1990 2007 2024 2041 
##  113  114  115  116  117  118  119  120  121 
## 2058 2075 2092 2109 2126 2144 2163 2181 2199
cell2_closest <- closest_by_time[2, ]
cell2_closest
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    2   20   42   62   82  101  121  141  160  179  198  218  238  258  278  298 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  318  338  356  374  394  413  432  454  473  492  512  531  548  568  588  609 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  629  652  672  694  717  739  761  781  803  825  847  870  892  913  932  950 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##  971  991 1011 1030 1049 1068 1086 1104 1122 1140 1159 1179 1197 1214 1235 1253 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
## 1271 1288 1306 1323 1342 1363 1384 1401 1418 1436 1455 1473 1489 1506 1523 1540 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
## 1556 1574 1590 1607 1621 1636 1651 1667 1682 1697 1713 1728 1743 1758 1773 1788 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
## 1803 1818 1834 1850 1866 1881 1897 1912 1928 1944 1961 1978 1994 2011 2028 2045 
##  113  114  115  116  117  118  119  120  121 
## 2062 2079 2096 2113 2130 2148 2167 2185 2203
cell5_closest <- closest_by_time[5, ]
cell5_closest
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    5   24   43   63   83  102  122  142  161  180  199  219  239  259  279  299 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  319  339  358  376  396  416  435  456  476  494  514  530  549  569  589  611 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  632  654  676  698  719  741  763  783  806  828  850  873  894  914  934  954 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##  974  995 1015 1034 1053 1072 1090 1107 1126 1144 1162 1181 1201 1219 1237 1255 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
## 1273 1291 1308 1325 1344 1365 1385 1403 1420 1439 1457 1476 1493 1510 1527 1544 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
## 1560 1577 1594 1610 1625 1640 1655 1670 1685 1700 1716 1731 1746 1761 1776 1790 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
## 1806 1820 1836 1852 1868 1884 1900 1915 1931 1947 1963 1980 1996 2013 2030 2047 
##  113  114  115  116  117  118  119  120  121 
## 2064 2081 2098 2115 2132 2150 2169 2187 2204
## Each number of this vector is the ID of the cell over time.
cell1_areas <- marks(cell_positions)[cell1_closest, "Area"]

cell1_positions <- cell_positions[cell1_closest, ]

5.2 Now iterate over each pair of time points 1:2, 2:3, 3:4, … 120:121

Now lets add a little logic to extract just the cells for timepoints a,b

times <- unique(marks(cell_positions)[["Frame"]])
for (t in 2:max(times)) {
  first_idx <- marks(cell_positions)[["Frame"]] == t - 1
  second_idx <- marks(cell_positions)[["Frame"]] == t
  both_idx <- first_idx | second_idx
  subset <- cell_positions[both_idx, ]
  closest_pair <- nnwhich(subset, by = "Frame")
}

Ok, so I think I have a little bit of the idea now, let us try and implement something closer to what Jacques actually wants in python.

6 Loading necessary python libraries

I am writing this document in R markdown and using the ipython REPL to run the code because my confidence in my python fluency is not very high (I still kind of think in python2 sometimes because that is what I learned first) and I make stupid mistakes pretty regularly; so getting that immediate feedback that my datastructures are wrong is nice.

Unfortunately, the interaction between how I use rmarkdown and python and ipython is not perfect. Thus in the following python blocks you may see a stupid 1+1 statement at the beginning, this is a very weird work around to keep the statements I am sending from going haywire and sending my text notes to python.

7 Load the TSV data

In the above R implementation, I dumped the data to a tsv file, and will just read that here for simplicity.

While I am testing, I will extract only the columns containing the ID, x, y, and time(Frame).

file_location = os.path.join('distance_calculation', 'testing.tsv')
input = pandas.read_table(file_location)
input.columns.values[0] = "ID"
coords = input[["ID", "X", "Y", "Frame", "Area"]]

8 Load the coordinates into a geospatial distance method

geopandas looks like it handles coordinate data in a similar fashion to R’s spatstat.geom. I am not sure if it will be able to handle a z-stack; but I assume it will prove easy to extend the implementation to points_from_xyz().

gdf = geopandas.GeoDataFrame(
    coords,
    geometry = geopandas.points_from_xy(coords.X, coords.Y))

I think the easiest way to extract minimum distances in python is to split this df into pairs by Frame followed by using the sjoin_nearest() function to find nearest points from a->b.

8.1 subset the gdf data

In the following block, I will iterate over every time from 1 to end-1 and extract the elements of the gdf from those two time points. Then I will use the sjoin_nearest() to get me every time-pairwise closest distance as a new dataframe.

These pairwise distance dataframes will be appended to the eponymous pairwise_distances array; which upon completion will comprise n-1 elements where the first element includes the results for timepoints 1,2 and the last element includes n-1,n.

tt = 1 + 2

final_time = gdf.Frame.max()
pairwise_distances = []
for start_time in range(1, final_time):
    i = start_time
    j = i + 1
    ti_idx = gdf.Frame == i
    tj_idx = gdf.Frame == j
    print("Getting distances of dfs ", i, " and ", j, ".")
    ti = gdf[ti_idx]
    tj = gdf[tj_idx]
    ti_rows = ti.shape[0]
    tj_rows = tj.shape[0]
    titj = geopandas.sjoin_nearest(ti, tj, distance_col = "pairwise_dist")
    pairwise_distances.append(titj)
## Getting distances of dfs  1  and  2 .
## Getting distances of dfs  2  and  3 .
## Getting distances of dfs  3  and  4 .
## Getting distances of dfs  4  and  5 .
## Getting distances of dfs  5  and  6 .
## Getting distances of dfs  6  and  7 .
## Getting distances of dfs  7  and  8 .
## Getting distances of dfs  8  and  9 .
## Getting distances of dfs  9  and  10 .
## Getting distances of dfs  10  and  11 .
## Getting distances of dfs  11  and  12 .
## Getting distances of dfs  12  and  13 .
## Getting distances of dfs  13  and  14 .
## Getting distances of dfs  14  and  15 .
## Getting distances of dfs  15  and  16 .
## Getting distances of dfs  16  and  17 .
## Getting distances of dfs  17  and  18 .
## Getting distances of dfs  18  and  19 .
## Getting distances of dfs  19  and  20 .
## Getting distances of dfs  20  and  21 .
## Getting distances of dfs  21  and  22 .
## Getting distances of dfs  22  and  23 .
## Getting distances of dfs  23  and  24 .
## Getting distances of dfs  24  and  25 .
## Getting distances of dfs  25  and  26 .
## Getting distances of dfs  26  and  27 .
## Getting distances of dfs  27  and  28 .
## Getting distances of dfs  28  and  29 .
## Getting distances of dfs  29  and  30 .
## Getting distances of dfs  30  and  31 .
## Getting distances of dfs  31  and  32 .
## Getting distances of dfs  32  and  33 .
## Getting distances of dfs  33  and  34 .
## Getting distances of dfs  34  and  35 .
## Getting distances of dfs  35  and  36 .
## Getting distances of dfs  36  and  37 .
## Getting distances of dfs  37  and  38 .
## Getting distances of dfs  38  and  39 .
## Getting distances of dfs  39  and  40 .
## Getting distances of dfs  40  and  41 .
## Getting distances of dfs  41  and  42 .
## Getting distances of dfs  42  and  43 .
## Getting distances of dfs  43  and  44 .
## Getting distances of dfs  44  and  45 .
## Getting distances of dfs  45  and  46 .
## Getting distances of dfs  46  and  47 .
## Getting distances of dfs  47  and  48 .
## Getting distances of dfs  48  and  49 .
## Getting distances of dfs  49  and  50 .
## Getting distances of dfs  50  and  51 .
## Getting distances of dfs  51  and  52 .
## Getting distances of dfs  52  and  53 .
## Getting distances of dfs  53  and  54 .
## Getting distances of dfs  54  and  55 .
## Getting distances of dfs  55  and  56 .
## Getting distances of dfs  56  and  57 .
## Getting distances of dfs  57  and  58 .
## Getting distances of dfs  58  and  59 .
## Getting distances of dfs  59  and  60 .
## Getting distances of dfs  60  and  61 .
## Getting distances of dfs  61  and  62 .
## Getting distances of dfs  62  and  63 .
## Getting distances of dfs  63  and  64 .
## Getting distances of dfs  64  and  65 .
## Getting distances of dfs  65  and  66 .
## Getting distances of dfs  66  and  67 .
## Getting distances of dfs  67  and  68 .
## Getting distances of dfs  68  and  69 .
## Getting distances of dfs  69  and  70 .
## Getting distances of dfs  70  and  71 .
## Getting distances of dfs  71  and  72 .
## Getting distances of dfs  72  and  73 .
## Getting distances of dfs  73  and  74 .
## Getting distances of dfs  74  and  75 .
## Getting distances of dfs  75  and  76 .
## Getting distances of dfs  76  and  77 .
## Getting distances of dfs  77  and  78 .
## Getting distances of dfs  78  and  79 .
## Getting distances of dfs  79  and  80 .
## Getting distances of dfs  80  and  81 .
## Getting distances of dfs  81  and  82 .
## Getting distances of dfs  82  and  83 .
## Getting distances of dfs  83  and  84 .
## Getting distances of dfs  84  and  85 .
## Getting distances of dfs  85  and  86 .
## Getting distances of dfs  86  and  87 .
## Getting distances of dfs  87  and  88 .
## Getting distances of dfs  88  and  89 .
## Getting distances of dfs  89  and  90 .
## Getting distances of dfs  90  and  91 .
## Getting distances of dfs  91  and  92 .
## Getting distances of dfs  92  and  93 .
## Getting distances of dfs  93  and  94 .
## Getting distances of dfs  94  and  95 .
## Getting distances of dfs  95  and  96 .
## Getting distances of dfs  96  and  97 .
## Getting distances of dfs  97  and  98 .
## Getting distances of dfs  98  and  99 .
## Getting distances of dfs  99  and  100 .
## Getting distances of dfs  100  and  101 .
## Getting distances of dfs  101  and  102 .
## Getting distances of dfs  102  and  103 .
## Getting distances of dfs  103  and  104 .
## Getting distances of dfs  104  and  105 .
## Getting distances of dfs  105  and  106 .
## Getting distances of dfs  106  and  107 .
## Getting distances of dfs  107  and  108 .
## Getting distances of dfs  108  and  109 .
## Getting distances of dfs  109  and  110 .
## Getting distances of dfs  110  and  111 .
## Getting distances of dfs  111  and  112 .
## Getting distances of dfs  112  and  113 .
## Getting distances of dfs  113  and  114 .
## Getting distances of dfs  114  and  115 .
## Getting distances of dfs  115  and  116 .
## Getting distances of dfs  116  and  117 .
## Getting distances of dfs  117  and  118 .
## Getting distances of dfs  118  and  119 .
## Getting distances of dfs  119  and  120 .
## Getting distances of dfs  120  and  121 .

8.2 Filter the pairwise distances and find neighbors over time

Given the pairwise_distances array, I should be able to iterate over it and make a series of arrays containing the cell IDs which are composed of cells which are closest to each other over time.

The primary caveat here is that we know that sometimes the cells are merged or other shenanigans happen which cause the closest distance to get relatively large, thus I will filter out any pairwise distances which are greater than max_dist.

As a result, the number of cells observed over time will by definition not be the ~ 20 observed in the 121 timepoints, but will be some larger number comprised of the sum over time. A likely next step will be to merge these partial groups post-facto, but not right now, I just want to get this initial implementation working.

Here is the logic:

  1. Use id_counter to make a global cell ID across all timepoints.
  2. Use traced{} to associate the id_counter cell ID to an array of observed cells over time.
  3. Use ends{} to associate every known current endpoint with a id_counter cell ID so that we can keep track of who is who.
  4. Then iterate over every element of pairwise_distances and do the following:
  1. drop elements with distances > max_dist.
  2. check to see if the early-timepoint nearest neighbor (start_cell) has been seen in ends{}: * If so add end_cell to it and to the appropriate element of traced{}. * If not, initialize traced{} with the start/end and add it to ends{}.

Upon completion I should have a dictionary of arrays containing all of the contiugous nearest-neighbor cells that stayed within max_dist of each other.

## Start over if the distance between two cells is bigger than 10 units.
max_dist = 10.0
## Start over if the proportion between the bigger and smaller cell area has changed
## by more than a factor of 0.25.
max_prop = 0.7
id_counter = 0
## Cell IDs pointing to a list of cells
traced = {}
## Endpoints pointing to the cell IDs
ends = {}
for i in range(0, final_time - 1):
    query = pairwise_distances[i]

    passed_idx = query.pairwise_dist <= max_dist
    failed_idx = query.pairwise_dist > max_dist
    if (failed_idx.sum() > 0):
        print("Skipped ", failed_idx.sum(), " elements in segment ", i,
              " because distance is too large.")
        query = query[passed_idx]

    prop_change = query.Area_left / query.Area_right
    increased_idx = prop_change > 1.0
    prop_change[increased_idx] = 1.0 / prop_change[increased_idx]
    failed_idx = prop_change < max_prop
    passed_idx = prop_change >= max_prop
    if (failed_idx.sum() > 0):
        print("Skipped ", failed_idx.sum(), " elements in segment ", i,
              " because the size changed too much.")
        query = query[passed_idx]

    for row in query.itertuples():
        start_cell = row.ID_left
        end_cell = row.ID_right
        if start_cell in ends.keys():
            cell_id = ends[start_cell]
            current_value = traced[cell_id]
            current_value.append(end_cell)
            traced[cell_id] = current_value
            ends[end_cell] = cell_id
        else:
            id_counter = id_counter + 1
            traced[id_counter] = [start_cell, end_cell]
            ends[end_cell] = id_counter
## Skipped  2  elements in segment  1  because the size changed too much.
## Skipped  1  elements in segment  2  because the size changed too much.
## Skipped  2  elements in segment  3  because the size changed too much.
## Skipped  3  elements in segment  4  because the size changed too much.
## Skipped  2  elements in segment  6  because the size changed too much.
## Skipped  1  elements in segment  7  because the size changed too much.
## Skipped  1  elements in segment  8  because distance is too large.
## Skipped  1  elements in segment  8  because the size changed too much.
## Skipped  2  elements in segment  10  because the size changed too much.
## Skipped  1  elements in segment  11  because the size changed too much.
## Skipped  1  elements in segment  15  because the size changed too much.
## Skipped  1  elements in segment  16  because the size changed too much.
## Skipped  4  elements in segment  17  because the size changed too much.
## Skipped  3  elements in segment  18  because the size changed too much.
## Skipped  1  elements in segment  20  because distance is too large.
## Skipped  1  elements in segment  20  because the size changed too much.
## Skipped  2  elements in segment  23  because distance is too large.
## Skipped  2  elements in segment  23  because the size changed too much.
## Skipped  1  elements in segment  25  because distance is too large.
## Skipped  2  elements in segment  25  because the size changed too much.
## Skipped  2  elements in segment  26  because distance is too large.
## Skipped  1  elements in segment  27  because distance is too large.
## Skipped  2  elements in segment  32  because distance is too large.
## Skipped  1  elements in segment  33  because distance is too large.
## Skipped  1  elements in segment  34  because distance is too large.
## Skipped  1  elements in segment  34  because the size changed too much.
## Skipped  1  elements in segment  36  because distance is too large.
## Skipped  1  elements in segment  37  because the size changed too much.
## Skipped  4  elements in segment  38  because distance is too large.
## Skipped  1  elements in segment  38  because the size changed too much.
## Skipped  1  elements in segment  39  because distance is too large.
## Skipped  1  elements in segment  41  because distance is too large.
## Skipped  1  elements in segment  42  because distance is too large.
## Skipped  1  elements in segment  42  because the size changed too much.
## Skipped  1  elements in segment  43  because distance is too large.
## Skipped  1  elements in segment  43  because the size changed too much.
## Skipped  1  elements in segment  44  because the size changed too much.
## Skipped  2  elements in segment  45  because the size changed too much.
## Skipped  1  elements in segment  46  because distance is too large.
## Skipped  1  elements in segment  46  because the size changed too much.
## Skipped  1  elements in segment  47  because distance is too large.
## Skipped  1  elements in segment  47  because the size changed too much.
## Skipped  2  elements in segment  48  because the size changed too much.
## Skipped  1  elements in segment  49  because distance is too large.
## Skipped  2  elements in segment  52  because the size changed too much.
## Skipped  1  elements in segment  54  because the size changed too much.
## Skipped  2  elements in segment  55  because the size changed too much.
## Skipped  1  elements in segment  58  because distance is too large.
## Skipped  1  elements in segment  58  because the size changed too much.
## Skipped  1  elements in segment  59  because the size changed too much.
## Skipped  1  elements in segment  60  because the size changed too much.
## Skipped  1  elements in segment  61  because the size changed too much.
## Skipped  1  elements in segment  62  because the size changed too much.
## Skipped  1  elements in segment  65  because distance is too large.
## Skipped  1  elements in segment  66  because the size changed too much.
## Skipped  1  elements in segment  67  because the size changed too much.
## Skipped  1  elements in segment  68  because distance is too large.
## Skipped  1  elements in segment  69  because the size changed too much.
## Skipped  2  elements in segment  70  because distance is too large.
## Skipped  2  elements in segment  70  because the size changed too much.
## Skipped  1  elements in segment  71  because distance is too large.
## Skipped  1  elements in segment  71  because the size changed too much.
## Skipped  1  elements in segment  72  because the size changed too much.
## Skipped  1  elements in segment  73  because distance is too large.
## Skipped  1  elements in segment  75  because distance is too large.
## Skipped  1  elements in segment  78  because the size changed too much.
## Skipped  1  elements in segment  82  because distance is too large.
## Skipped  1  elements in segment  83  because distance is too large.
## Skipped  1  elements in segment  90  because distance is too large.
## Skipped  1  elements in segment  98  because the size changed too much.
## Skipped  1  elements in segment  99  because distance is too large.
## Skipped  1  elements in segment  101  because distance is too large.
## Skipped  1  elements in segment  101  because the size changed too much.
## Skipped  1  elements in segment  103  because distance is too large.
## Skipped  1  elements in segment  105  because the size changed too much.
## Skipped  2  elements in segment  106  because the size changed too much.
## Skipped  1  elements in segment  107  because the size changed too much.
## Skipped  1  elements in segment  115  because the size changed too much.
## Skipped  4  elements in segment  118  because the size changed too much.
traced
## {1: [1, 19], 2: [2, 20], 3: [3, 21, 41, 61, 81, 100, 120, 139, 158, 177, 196, 217, 237, 256, 277, 297, 316, 336, 355, 373, 393, 412, 431, 452, 472, 490, 510, 529, 546, 566, 586, 606, 627, 650, 670], 4: [4, 22, 40, 60, 80, 99, 119, 140, 159, 178, 197, 216, 236, 257, 276, 296, 317, 337, 357, 375, 395, 414, 433, 453, 474, 491, 511, 530, 547, 567, 587, 607, 628, 651, 671, 693, 715, 737, 759, 780, 802, 824, 846, 868, 890, 911, 931, 951, 970, 990, 1010, 1029, 1048, 1067, 1085, 1103, 1121, 1139, 1155, 1177, 1193, 1212, 1230, 1248, 1266, 1284, 1302, 1319, 1338, 1358, 1378, 1398, 1415, 1432, 1451, 1469, 1487, 1504, 1521, 1538, 1554, 1570, 1587], 5: [5, 24, 43, 63], 6: [6, 25, 44, 64, 84, 103, 123, 143, 162, 181, 200, 221, 240, 260, 280, 300, 320, 340, 359, 377, 397, 415, 434, 455, 475, 493, 513, 532, 550, 570, 590, 610, 630, 653, 673, 695, 716, 738, 760, 782, 804, 826, 848, 871, 891, 912, 933, 953, 972, 992, 1012, 1031, 1050, 1069, 1087, 1105, 1123, 1141, 1160, 1178, 1198, 1216, 1234, 1252, 1270, 1289, 1305, 1322, 1341, 1362, 1382, 1400, 1416, 1433, 1452, 1470, 1488, 1505, 1522, 1539, 1555, 1571, 1588, 1604, 1620, 1635, 1650, 1665, 1680, 1695, 1711, 1726, 1741, 1756, 1771, 1786, 1801, 1816, 1832, 1848, 1864, 1879, 1895, 1911, 1926, 1942, 1959, 1976, 1992, 2009, 2026, 2043, 2060, 2077, 2094, 2111, 2128, 2146, 2165, 2184, 2202], 7: [7, 26, 45, 65, 85, 105, 124, 144, 164, 182, 201, 222, 242, 262, 282, 302, 322, 342, 360], 8: [8, 27, 47, 67, 86, 106, 126, 146, 165, 184, 203, 223, 243, 263, 283, 303, 323, 343, 362, 380, 399, 418, 437, 458, 477, 496, 516, 533, 553, 573, 593, 614, 637, 658, 680, 702, 724, 746, 768, 786, 809, 831, 855, 876, 897, 918, 937, 957, 977, 998, 1017, 1037, 1055, 1074, 1092, 1110, 1128, 1146, 1165, 1184, 1203, 1221, 1241, 1258, 1276, 1294, 1311, 1328, 1348, 1369, 1389, 1407, 1423, 1441, 1459, 1477, 1494, 1511, 1528, 1545, 1561, 1578, 1595, 1611, 1626, 1641, 1656, 1671, 1686, 1701, 1717, 1732, 1747, 1762, 1777, 1792, 1807, 1822, 1838, 1854, 1870, 1885, 1901, 1916, 1932, 1948, 1965, 1982, 1998, 2015, 2032, 2049, 2066, 2083, 2100, 2117, 2134, 2152, 2170, 2189, 2206], 9: [9, 28, 46, 66], 10: [10, 29, 49, 69, 88], 11: [11, 30, 48, 68, 87, 107, 127, 147, 166, 185, 204, 224, 244, 264, 284, 304, 324, 344, 363, 383, 403, 422, 442, 463, 481, 501, 520, 537, 556, 576, 596, 617, 639, 660, 682, 704, 726, 748, 770, 790, 812, 834, 857, 878, 899, 920, 940, 960, 980, 1000, 1020, 1039, 1058, 1077, 1095, 1113, 1131, 1149, 1168, 1187, 1206, 1224, 1242, 1260, 1278, 1296, 1313, 1330, 1350, 1371, 1391, 1409, 1425, 1443, 1462, 1480, 1497, 1514, 1531, 1548, 1565, 1582, 1599, 1615, 1630, 1645, 1660, 1675, 1690, 1705, 1721, 1736, 1751, 1766, 1781, 1796, 1811, 1827, 1843, 1858, 1873, 1889, 1904, 1920, 1935, 1952, 1969, 1985, 2002, 2019, 2036, 2053, 2070, 2087, 2104, 2121, 2138, 2156, 2175, 2192, 2210], 12: [12, 31, 50, 70, 89, 109, 129], 13: [13, 32, 52, 72, 90, 111, 131, 150, 170, 189, 208, 228, 248, 268, 288, 308, 328, 349], 14: [14, 33, 51, 71, 91, 110, 130], 15: [15, 35, 54, 74, 93, 113, 133, 152, 172, 191, 210, 230, 250, 270, 290, 310], 16: [16, 34, 53, 73, 92, 112, 132, 151, 171, 190, 209, 229, 249, 269, 289, 309, 329, 348, 367, 385, 405, 425, 444, 466, 483, 503, 521, 539, 558, 578, 598, 619, 641, 662, 684, 706, 728, 750, 772, 792, 814, 836, 859, 880, 901, 922, 942, 962, 982, 1002, 1022, 1041, 1060], 17: [17, 36, 55, 75, 94, 114, 134, 153, 173, 192, 211, 231, 251, 271, 291, 311, 331, 351, 369, 388, 408, 427, 447, 468, 486, 505, 524, 542, 561, 581, 601, 622, 644, 665, 687, 709, 731, 753, 776, 796, 818, 840, 862, 884, 905, 926, 946, 966, 985, 1005, 1025, 1044, 1063, 1081, 1099, 1117, 1135, 1153, 1172, 1191, 1211, 1229, 1247, 1265, 1283, 1301, 1318, 1336, 1357, 1377, 1397, 1414, 1431, 1449, 1467, 1486, 1503, 1520, 1537, 1553, 1569, 1586, 1603, 1619, 1634, 1649, 1664, 1679, 1694, 1709, 1725, 1740, 1755, 1770, 1785, 1800, 1815, 1830, 1846, 1862, 1877, 1893, 1908, 1924, 1939, 1956, 1973, 1989, 2006, 2023, 2040, 2057, 2074, 2091, 2108, 2125, 2142, 2160, 2179, 2196, 2214], 18: [18, 37, 56, 76, 95, 115, 135, 154, 174, 193, 212, 232, 252, 272, 292, 312, 332, 352, 370, 389, 409, 428, 448, 469, 487, 506, 525, 544, 562, 582, 602, 623, 645, 666, 688, 710, 732, 754, 777, 797, 819, 841, 863, 885, 906, 927, 947, 967, 986, 1006, 1026, 1045, 1064, 1082, 1100, 1118, 1136, 1154, 1173, 1192, 1210, 1228, 1246, 1264, 1282, 1300, 1317, 1335, 1356, 1376, 1396, 1413, 1430], 19: [23, 42, 62, 82, 101, 121, 141, 160, 179, 198, 218, 238, 258, 278, 298, 318, 338, 356, 374, 394], 20: [38, 57, 77, 96, 116, 136, 155, 175, 194, 213, 233, 253, 273, 293, 313, 333, 353, 371, 390, 410, 429, 449, 470, 488, 507, 526, 543, 563, 583, 603, 624, 646, 667, 689, 711, 733, 755], 21: [58, 78], 22: [59, 79, 98, 118, 138, 157, 176, 195, 215, 235, 255, 275, 295, 315], 23: [97, 117, 137, 156], 24: [102, 122, 142, 161, 180, 199, 219, 239, 259, 279, 299, 319, 339, 358, 376, 396, 416, 435, 456, 476, 494, 514], 25: [104, 125, 145, 163, 183, 202], 26: [108, 128, 148, 167], 27: [168, 187, 206, 225, 246, 266, 285, 305, 326, 345, 364, 381, 401, 420, 439, 460, 479, 498, 518, 535, 554, 574, 595, 615, 636, 657, 679, 700, 723, 745, 767, 787, 810, 832, 854, 875, 896, 917, 939, 958, 978, 997, 1018, 1036, 1056, 1075, 1093, 1111, 1129, 1147, 1166, 1185, 1204, 1222, 1239, 1257, 1275, 1293, 1310, 1327, 1346, 1367, 1386, 1405, 1421, 1438, 1458, 1475, 1492, 1508, 1525, 1542, 1558, 1575, 1593, 1609, 1623, 1638, 1653, 1668, 1683, 1698, 1714, 1729, 1745, 1760, 1774, 1791, 1805, 1821, 1837, 1853, 1869, 1883], 28: [169, 188, 207, 227, 247, 267, 287, 307, 327, 347, 366, 384, 404, 423, 443, 464, 482, 500], 29: [186, 205], 30: [214, 234, 254, 274, 294, 314, 334], 31: [220, 241, 261, 281, 301, 321, 341, 361, 378, 398, 417, 436, 457], 32: [245, 265, 286, 306, 325, 346, 365, 382, 402, 421, 441, 462, 480, 499, 519, 536, 555, 575, 594, 616, 638, 659, 681, 701, 722, 747, 769, 788, 811, 833, 856, 877, 898, 919, 938, 959, 979, 999, 1019, 1038, 1057, 1076, 1094, 1112, 1130, 1148, 1167, 1186, 1205, 1223, 1240, 1259, 1277, 1295, 1312, 1329, 1349, 1370, 1390, 1408, 1424, 1442, 1460, 1478, 1495, 1512, 1529, 1546, 1562, 1579, 1596, 1612, 1627, 1642, 1657, 1672, 1687, 1702, 1718, 1733, 1748, 1764, 1779, 1793, 1809, 1823, 1840, 1856, 1871, 1887, 1902, 1918, 1933, 1950, 1967, 1983, 2000, 2017, 2034, 2051, 2069, 2085, 2102, 2119], 33: [330, 350], 34: [372, 391, 411, 430, 451], 35: [379, 400, 419, 438, 459, 478, 497, 517, 534, 552, 572, 592, 613, 634, 656, 677, 699, 721, 743, 765, 785, 808, 830, 852, 874, 895, 916, 936, 956, 976, 996, 1016, 1035, 1054, 1073, 1091, 1109, 1127, 1145, 1164, 1183, 1202, 1220, 1238, 1256, 1274, 1292, 1309, 1326, 1345, 1366, 1387, 1404, 1419, 1437, 1456, 1474, 1491, 1509, 1526, 1543, 1559, 1576, 1592, 1608, 1624, 1639, 1654, 1669, 1684, 1699, 1715, 1730, 1744, 1759, 1775, 1789, 1804, 1819, 1835, 1851, 1867, 1882, 1899, 1914, 1930, 1946, 1964, 1981, 1997, 2014, 2031, 2048, 2065, 2082, 2099, 2116, 2133, 2151, 2171, 2188, 2207], 36: [386, 406, 424, 445, 465, 484, 502, 522, 540, 559, 579, 599, 621, 642, 663, 685, 707, 729, 751, 773, 793, 815, 837], 37: [387, 407, 426, 446, 467, 485, 504], 38: [413, 432, 454, 473, 492, 512, 531, 548, 568, 588, 608, 629, 652, 672, 694, 717, 739, 761, 781, 803, 825, 847, 870, 892, 913, 932, 952, 973, 993, 1013, 1032, 1051, 1070, 1089, 1108, 1125, 1143, 1163, 1182, 1200, 1217], 39: [440, 461], 40: [471, 489], 41: [495, 515], 42: [508, 528], 43: [509, 527, 545, 565, 585, 605, 626, 648, 669, 691, 713, 735, 757], 44: [523, 541, 560, 580, 600, 620, 643, 664, 686, 708, 730, 752, 774, 794, 816, 838, 860, 881, 902, 924, 943, 964, 984, 1003, 1023, 1042, 1061, 1080, 1098, 1116, 1134, 1152, 1171, 1190, 1209, 1227, 1244, 1263, 1281, 1299, 1315, 1332, 1353, 1373, 1393, 1411, 1427, 1445, 1463, 1481, 1498, 1515, 1532, 1549, 1564, 1581, 1598, 1614, 1629, 1644, 1659, 1674, 1689, 1704, 1720, 1735, 1750, 1765, 1780, 1795, 1810, 1825, 1841, 1857, 1872, 1888, 1903, 1919, 1934, 1951, 1968, 1984, 2001, 2018, 2035, 2052, 2068, 2086, 2103, 2120, 2136, 2154, 2173, 2191, 2208], 45: [538, 557, 577, 597, 618, 640, 661, 683, 705, 727, 749, 771, 791, 813, 835, 858, 879, 900, 921, 941, 961, 981, 1001, 1021, 1040, 1059, 1078, 1096, 1114, 1132, 1150, 1169, 1188, 1207, 1225, 1243, 1261, 1279, 1297, 1314, 1331, 1351, 1372, 1392, 1410, 1426, 1444, 1461, 1479, 1496, 1513, 1530, 1547, 1563, 1580, 1597, 1613, 1628, 1643, 1658, 1673, 1688, 1703, 1719, 1734, 1749, 1763, 1778, 1794, 1808, 1824, 1839], 46: [549, 569, 589, 611, 632, 654, 676, 698, 719, 741, 763, 783, 806, 828, 850, 873, 894, 914, 934, 954, 974, 995, 1015, 1034, 1053, 1072, 1090, 1107, 1126, 1144, 1162, 1181, 1201, 1219, 1237, 1255, 1273, 1291, 1308, 1325, 1344, 1365, 1385, 1403, 1420, 1439, 1457, 1476, 1493, 1510, 1527, 1544, 1560, 1577, 1594, 1610, 1625, 1640, 1655, 1670, 1685, 1700, 1716, 1731, 1746, 1761, 1776, 1790, 1806, 1820, 1836, 1852, 1868, 1884, 1900, 1915, 1931, 1947, 1963, 1980, 1996, 2013, 2030, 2047, 2064, 2081, 2098, 2115, 2132, 2150, 2169, 2187, 2204], 47: [551, 571, 591, 612, 633, 655, 675, 697, 720, 742, 764, 784, 807, 829, 851, 872, 893, 915, 935, 955, 975, 994, 1014, 1033, 1052, 1071, 1088, 1106, 1124, 1142, 1161, 1180, 1199, 1218, 1236, 1254, 1272, 1290, 1307, 1324, 1343, 1364, 1383, 1402, 1417, 1435, 1454, 1472, 1490, 1507, 1524, 1541, 1557, 1573, 1591, 1606, 1622, 1637, 1652, 1666, 1681, 1696, 1712, 1727, 1742, 1757, 1772, 1787, 1802, 1817, 1833, 1849, 1865, 1880, 1896, 1910, 1927, 1943, 1960, 1977, 1993, 2010, 2027, 2044, 2061, 2078, 2095, 2112, 2129, 2147, 2166, 2183, 2201], 48: [564, 584, 604, 625, 647, 668, 690, 712, 734, 756], 49: [609, 631], 50: [674, 696, 718, 740, 762], 51: [692, 714, 736, 758, 779, 801, 823, 845, 867, 889, 909, 930, 949, 969, 989, 1009, 1028, 1047, 1066, 1084, 1102, 1120, 1138, 1158, 1176, 1196, 1215, 1233, 1251, 1269, 1287, 1304, 1321, 1340, 1360], 52: [703, 725], 53: [744, 766], 54: [775, 795, 817, 839, 861, 883, 904, 925, 945], 55: [798, 820, 842, 864], 56: [799, 821, 843, 865, 887, 910], 57: [800, 822, 844, 866, 888, 908], 58: [805, 827, 849, 869], 59: [882, 903, 923, 944, 963, 983, 1004, 1024, 1043, 1062], 60: [928, 948], 61: [929, 950], 62: [968, 988, 1008, 1027, 1046, 1065, 1083, 1101, 1119, 1137, 1157, 1175], 63: [991, 1011, 1030, 1049, 1068, 1086, 1104], 64: [1079, 1097], 65: [1122, 1140, 1159, 1179, 1197, 1214, 1232], 66: [1133, 1151, 1170, 1189, 1208, 1226, 1245, 1262, 1280, 1298, 1316], 67: [1194, 1213, 1231, 1249, 1267, 1285, 1303, 1320, 1339, 1359, 1379], 68: [1235, 1253, 1271, 1288, 1306, 1323, 1342, 1363, 1384, 1401, 1418, 1436, 1455, 1473, 1489, 1506, 1523, 1540, 1556, 1574, 1590, 1607, 1621, 1636, 1651, 1667, 1682, 1697, 1713, 1728, 1743, 1758, 1773, 1788, 1803, 1818, 1834, 1850, 1866, 1881, 1897, 1912, 1928, 1944, 1961, 1978, 1994, 2011, 2028, 2045, 2062, 2079, 2096, 2113, 2130, 2148, 2167, 2185, 2203], 69: [1250, 1268, 1286], 70: [1333, 1354, 1374, 1394], 71: [1334, 1355, 1375, 1395], 72: [1347, 1368, 1388, 1406, 1422, 1440], 73: [1361, 1381], 74: [1380, 1399], 75: [1428, 1446, 1464, 1482, 1499, 1516, 1533, 1550, 1566, 1583, 1600, 1616, 1631, 1646, 1661, 1676, 1691, 1706, 1722, 1737, 1752, 1767, 1782, 1797, 1812, 1826, 1842, 1859, 1874, 1890, 1905, 1921, 1936, 1954, 1971, 1987, 2004, 2021, 2038, 2055, 2072, 2089, 2106, 2123, 2140, 2158, 2177, 2194, 2212], 76: [1429, 1447, 1465, 1483, 1500, 1517, 1534, 1551, 1567, 1584, 1601, 1617, 1632, 1648, 1663, 1678, 1693, 1708, 1724, 1739, 1754, 1769, 1784, 1799, 1814, 1829, 1845, 1861, 1876, 1892, 1907, 1923, 1938, 1955, 1972, 1988, 2005, 2022, 2039, 2056, 2073, 2090, 2107, 2124, 2141, 2159, 2178, 2195, 2213], 77: [1434, 1453, 1471], 78: [1448, 1466, 1484, 1501, 1518, 1535, 1552, 1568, 1585, 1602, 1618, 1633, 1647, 1662, 1677, 1692, 1707, 1723, 1738, 1753, 1768, 1783, 1798, 1813, 1828, 1844, 1860, 1875, 1891, 1906, 1922, 1937, 1953, 1970, 1986, 2003, 2020, 2037, 2054, 2071, 2088, 2105, 2122, 2139, 2157, 2176, 2193, 2211], 79: [1450, 1468, 1485, 1502, 1519, 1536], 80: [1572, 1589, 1605], 81: [1831, 1847, 1863, 1878, 1894, 1909, 1925, 1940, 1957, 1974, 1991, 2008, 2025, 2042, 2059, 2076, 2093, 2110, 2127, 2145, 2164, 2182, 2200], 82: [1898, 1913, 1929, 1945, 1962], 83: [1949, 1966], 84: [1958, 1975, 1990, 2007, 2024, 2041, 2058, 2075, 2092, 2109, 2126, 2143, 2162, 2180, 2198], 85: [1995, 2012, 2029, 2046, 2063, 2080, 2097, 2114, 2131, 2149, 2168], 86: [1999, 2016, 2033, 2050, 2067, 2084, 2101, 2118, 2135, 2153, 2172], 87: [2137, 2155, 2174, 2190, 2209], 88: [2144, 2163], 89: [2181, 2199], 90: [2186, 2205]}

Ok, I looked at the result of traced and it seems to make sense to me; lets rewrite this as a single function that Jacques can add to his notebook. I will assume he has a pandas df as input.

8.3 Get information from a group of cells

As a final step, we should be able to extract and play with the information from one or more groups of cells.

cell_id = 6
cell_idx = input['ID'].isin(traced[cell_id])
cell_data = input[cell_idx].reset_index()

plt.scatter(cell_data['X'], cell_data['Y'])
final_row = cell_data.Frame.max()
for start_time in range(1, final_row - 1):
    ti_idx = cell_data.Frame == start_time
    tj_idx = cell_data.Frame == start_time + 1
    p1x = cell_data[ti_idx].X
    p2x = cell_data[tj_idx].X
    p1y = cell_data[ti_idx].Y
    p2y = cell_data[tj_idx].Y
    x_points = [p1x, p2x]
    y_points = [p1y, p2y]
    plt.plot(x_points, y_points)
plt.show()

seaborn.violinplot(data = cell_data.Area)
plt.show()

9 Single function definition

Now let us combine the things learned above into a single function.

def nearest_cells_over_time(df, max_dist = 10.0, x_column = 'X', y_column = 'Y'):
    """Trace cells over time"""
    gdf = geopandas.GeoDataFrame(
        df,
        geometry = geopandas.points_from_xy(df[x_column], df[y_column]))

    final_time = gdf.Frame.max()
    pairwise_distances = []
    for start_time in range(1, final_time):
        i = start_time
        j = i + 1
        ti_idx = gdf.Frame == i
        tj_idx = gdf.Frame == j
        print("Getting distances of dfs ", i, " and ", j, ".")
        ti = gdf[ti_idx]
        tj = gdf[tj_idx]
        ti_rows = ti.shape[0]
        tj_rows = tj.shape[0]
        titj = geopandas.sjoin_nearest(ti, tj, distance_col = "pairwise_dist")
        pairwise_distances.append(titj)

    max_dist = 10.0
    id_counter = 0
    ## Cell IDs pointing to a list of cells
    traced = {}
    ## Endpoints pointing to the cell IDs
    ends = {}
    for i in range(0, final_time - 1):
        query = pairwise_distances[i]
        passed_idx = query.pairwise_dist <= max_dist
        failed_idx = query.pairwise_dist > max_dist
        if (failed_idx.sum() > 0):
            print("Skipped ", failed_idx.sum(), " elements in segment ", i, ".")
        query = query[passed_idx]
        for row in query.itertuples():
            start_cell = row.ID_left
            end_cell = row.ID_right
            if start_cell in ends.keys():
                cell_id = ends[start_cell]
                current_value = traced[cell_id]
                current_value.append(end_cell)
                traced[cell_id] = current_value
                ends[end_cell] = cell_id
            else:
                id_counter = id_counter + 1
                traced[id_counter] = [start_cell, end_cell]
                ends[end_cell] = id_counter

    return traced
pander::pander(sessionInfo())

R version 4.2.0 (2022-04-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=en_US.UTF-8, LC_ADDRESS=en_US.UTF-8, LC_TELEPHONE=en_US.UTF-8, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=en_US.UTF-8

attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: spatstat.geom(v.3.0-6), spatstat.data(v.3.0-0), hpgltools(v.1.0), testthat(v.3.1.6), SummarizedExperiment(v.1.28.0), GenomicRanges(v.1.50.2), GenomeInfoDb(v.1.34.9), IRanges(v.2.32.0), S4Vectors(v.0.36.1), MatrixGenerics(v.1.10.0), matrixStats(v.0.63.0), Biobase(v.2.58.0) and BiocGenerics(v.0.44.0)

loaded via a namespace (and not attached): rappdirs(v.0.3.3), rtracklayer(v.1.58.0), tidyr(v.1.3.0), ggplot2(v.3.4.1), clusterGeneration(v.1.3.7), bit64(v.4.0.5), knitr(v.1.42), DelayedArray(v.0.24.0), data.table(v.1.14.6), KEGGREST(v.1.38.0), RCurl(v.1.98-1.10), doParallel(v.1.0.17), generics(v.0.1.3), GenomicFeatures(v.1.50.4), callr(v.3.7.3), RhpcBLASctl(v.0.23-42), cowplot(v.1.1.1), usethis(v.2.1.6), RSQLite(v.2.2.20), shadowtext(v.0.1.2), tzdb(v.0.3.0), bit(v.4.0.5), enrichplot(v.1.18.3), xml2(v.1.3.3), httpuv(v.1.6.8), assertthat(v.0.2.1), viridis(v.0.6.2), xfun(v.0.37), hms(v.1.1.2), jquerylib(v.0.1.4), evaluate(v.0.20), promises(v.1.2.0.1), fansi(v.1.0.4), restfulr(v.0.0.15), progress(v.1.2.2), caTools(v.1.18.2), dbplyr(v.2.3.0), igraph(v.1.4.0), DBI(v.1.1.3), htmlwidgets(v.1.6.1), purrr(v.1.0.1), ellipsis(v.0.3.2), dplyr(v.1.1.0), backports(v.1.4.1), annotate(v.1.76.0), aod(v.1.3.2), biomaRt(v.2.54.0), deldir(v.1.0-6), vctrs(v.0.5.2), here(v.1.0.1), remotes(v.2.4.2), cachem(v.1.0.6), withr(v.2.5.0), ggforce(v.0.4.1), HDO.db(v.0.99.1), vroom(v.1.6.1), GenomicAlignments(v.1.34.0), treeio(v.1.22.0), prettyunits(v.1.1.1), DOSE(v.3.24.2), ape(v.5.6-2), lazyeval(v.0.2.2), crayon(v.1.5.2), genefilter(v.1.80.3), edgeR(v.3.40.2), pkgconfig(v.2.0.3), tweenr(v.2.0.2), nlme(v.3.1-162), pkgload(v.1.3.2), devtools(v.2.4.5), rlang(v.1.0.6), lifecycle(v.1.0.3), miniUI(v.0.1.1.1), downloader(v.0.4), filelock(v.1.0.2), BiocFileCache(v.2.6.0), rprojroot(v.2.0.3), polyclip(v.1.10-4), graph(v.1.76.0), Matrix(v.1.5-3), aplot(v.0.1.9), boot(v.1.3-28.1), processx(v.3.8.0), png(v.0.1-8), viridisLite(v.0.4.1), rjson(v.0.2.21), bitops(v.1.0-7), gson(v.0.0.9), KernSmooth(v.2.23-20), pander(v.0.6.5), Biostrings(v.2.66.0), blob(v.1.2.3), stringr(v.1.5.0), qvalue(v.2.30.0), readr(v.2.1.4), remaCor(v.0.0.11), gridGraphics(v.0.5-1), scales(v.1.2.1), memoise(v.2.0.1), GSEABase(v.1.60.0), magrittr(v.2.0.3), plyr(v.1.8.8), gplots(v.3.1.3), zlibbioc(v.1.44.0), compiler(v.4.2.0), scatterpie(v.0.1.8), BiocIO(v.1.8.0), RColorBrewer(v.1.1-3), lme4(v.1.1-31), Rsamtools(v.2.14.0), cli(v.3.6.0), XVector(v.0.38.0), urlchecker(v.1.0.1), patchwork(v.1.1.2), ps(v.1.7.2), MASS(v.7.3-58.2), mgcv(v.1.8-41), tidyselect(v.1.2.0), stringi(v.1.7.12), highr(v.0.10), yaml(v.2.3.7), GOSemSim(v.2.24.0), locfit(v.1.5-9.7), ggrepel(v.0.9.3), grid(v.4.2.0), sass(v.0.4.5), fastmatch(v.1.1-3), tools(v.4.2.0), parallel(v.4.2.0), rstudioapi(v.0.14), foreach(v.1.5.2), gridExtra(v.2.3), farver(v.2.1.1), ggraph(v.2.1.0), digest(v.0.6.31), shiny(v.1.7.4), Rcpp(v.1.0.10), broom(v.1.0.3), later(v.1.3.0), httr(v.1.4.4), AnnotationDbi(v.1.60.0), Rdpack(v.2.4), colorspace(v.2.1-0), brio(v.1.1.3), XML(v.3.99-0.13), fs(v.1.6.1), reticulate(v.1.28), splines(v.4.2.0), yulab.utils(v.0.0.6), PROPER(v.1.30.0), tidytree(v.0.4.2), spatstat.utils(v.3.0-1), graphlayouts(v.0.8.4), ggplotify(v.0.1.0), plotly(v.4.10.1), sessioninfo(v.1.2.2), xtable(v.1.8-4), jsonlite(v.1.8.4), nloptr(v.2.0.3), ggtree(v.3.6.2), tidygraph(v.1.2.3), ggfun(v.0.0.9), R6(v.2.5.1), RUnit(v.0.4.32), profvis(v.0.3.7), pillar(v.1.8.1), htmltools(v.0.5.4), mime(v.0.12), glue(v.1.6.2), fastmap(v.1.1.0), minqa(v.1.2.5), clusterProfiler(v.4.6.0), BiocParallel(v.1.32.5), codetools(v.0.2-19), fgsea(v.1.24.0), pkgbuild(v.1.4.0), mvtnorm(v.1.1-3), utf8(v.1.2.3), lattice(v.0.20-45), bslib(v.0.4.2), tibble(v.3.1.8), sva(v.3.46.0), pbkrtest(v.0.5.2), curl(v.5.0.0), gtools(v.3.9.4), zip(v.2.2.2), openxlsx(v.4.2.5.2), GO.db(v.3.16.0), survival(v.3.5-3), limma(v.3.54.1), rmarkdown(v.2.20), desc(v.1.4.2), munsell(v.0.5.0), GenomeInfoDbData(v.1.2.9), iterators(v.1.0.14), variancePartition(v.1.28.4), reshape2(v.1.4.4), gtable(v.0.3.1) and rbibutils(v.2.2.13)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset c100d5666f5032d24c933739015d267ef651c323
## This is hpgltools commit: Wed Mar 1 09:50:14 2023 -0500: c100d5666f5032d24c933739015d267ef651c323
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to index-v20230315.rda.xz
tmp <- sm(saveme(filename=this_save))
---
title: "Notes for Jacques images."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include=FALSE}
library("hpgltools")
## library("reticulate")
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(width=120,
                     progress=TRUE,
                     verbose=TRUE,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      dpi=96)
lua_filters <- rmarkdown::pandoc_lua_filter_args("pandoc-zotxt.lua")
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
rundate <- format(Sys.Date(), format="%Y%m%d")
previous_file <- ""
ver <- format(Sys.Date(), "%Y%m%d")

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "index.Rmd"
library(spatstat.geom)
```

# Settings which may be necessary in emacs

I have been messing with my emacs setup to make this project easier.

```{elisp setup_workon, eval=FALSE}
(defvar elpy-rpc-virtualenv-path 'current)
(pyvenv-activate "/sw/local/fiji/202211")
(pyvenv-activate "venv")
```

# Introduction

Jacques is seeking to segment and follow cells over time.

Steps performed so far:

1.  Python/venv installation of fiji
2.  Figured out some easy interactions between python and actual image data.
3.  Implemented a simple function to find minimum distances between
    putative cells using the extant methods.

# Next steps

1.  Note that the czi images are immediately openable via fiji's bioformat interface.
2.  Load dataset via fiji
3.  Invoke cellpose via the python interface (same as pyimagej)
4.  Save ROIs produced by cellpose, save them to zip output file
  a.  The roi interface in fiji can address these
5.  Currently a macro is invoked which performs a set of measurements
    on every ROI and saves them to a csv.  This creates the primary
    data structure used for all following processes.
6.  Need to create a datastructure which identifies each individual
    cell over time given these ROIs which x/y/time(frame) along with
    the measurements of interest (area/mean/stdev/intensities by
    color).
    a.  Small postprocessing details: the intensity values produced
    must be normalized by cell area.

# Demonstrate the first couple of steps using an actual dataset

Jacques sent me an image acquired from the microscope and I saved it
as 'test_data/raw.tif', he also sent me a cellpose model which I saved
to the 'models/' directory.

Given these as a starting point, let us try to open the image with a
fiji instance and split it into a series of smaller files by
timepoint.

## Load necessary python modules

```{python load}
import os
import imagej
import pandas
import numpy as np
import geopandas
import matplotlib.pyplot as plt
import seaborn
import scyjava
import multiprocessing as mp
```

## Set up some initial variables

```{python parameters}
base_dir = '/home/trey/scratch/jacques'
os.chdir(base_dir)
input_file = f"{base_dir}/test_data/raw.tif"
output_dir = f"{base_dir}/outputs"
```

## Start imagej/fiji

Note that I am using fiji/imagej from within a virtual environment
which I symbolically linked to the local directory 'venv/'.

```{python start_fiji}
scyjava.config.add_option('-Xmx64g')
ij = imagej.init('venv/bin/Fiji.app', mode = 'interactive')
ij.getVersion()
os.chdir(base_dir)
```

## Open the input file and split it by time

Open the file, figure out its dimensions, and write portions of it to
an output directory.

I wrote this thinking I could parallelize the output writing to 8
cpus.  But I think I do not yet understand how python scopes variables
in this context and so it did not quite work.  I turned that off for
the moment and ran it and it finished in about a minute.

```{python split_image, eval=FALSE}
def separate_slices(input_file, output_base = 'outputs', wanted_x = True, wanted_y = True,
                    wanted_z = 1, wanted_channel = 2, cpus = 8):
    pool = mp.Pool(cpus)
    input_base = os.path.basename(input_file)
    input_dir = os.path.dirname(input_file)
    input_name = os.path.splitext(input_base)[0]
    output_directory = f"{input_dir}/outputs/{input_name}_substack"
    os.makedirs(output_directory, exist_ok = True)
    print("Starting to open the input file, this takes a moment.")
    dataset = ij.io().open(input_file)
    print(f"Opened input file, writing images to {output_directory}")

    data_info = {}
    for element in range(len(dataset.dims)):
        name = dataset.dims[element]
        data_info[name] = dataset.shape[element]
    print(f"This dataset has dimensions: X:{data_info['X']} Y:{data_info['Y']} Z:{data_info['Z']} Time:{data_info['Time']}")

    def save_slice(timepoint):
        wanted_slice = dataset[:, :, wanted_channel, wanted_z, timepoint]
        slice_image = ij.py.to_dataset(wanted_slice)
        output_filename = f"{output_directory}/{input_name}_{timepoint}.tif"
        saved = ij.io().save(slice_image, output_filename)
        message = f"Saving image {input_name}_{timepoint}."
        print(message)

    for timepoint in range(data_info['Time']):
        saved = save_slice(timepoint)
        ## I want to see if I can split the saves across 8 cpus.
        ## pool.apply(save_slice, args = (timepoint, ))
    ## pool.close()
    return dataset

raw_split = separate_slices(input_file)
```

## Create

# Make an initial implementation of following

1.  Split the table into a list of dataframes which include a column
    containing the minimum distance and ID of the mindist element in
    the next/previous element of the list.

```{r follow_cells}
start <- openxlsx::readWorkbook("distance_calculation/exp0329scene5.xlsx")
readr::write_tsv(x=start, file="distance_calculation/testing.tsv")
```

## Absolute value distances among cells

Lets make a quick implementation of an absolute distance of each cell
over time in the hopes that I can better understand the question.

```{r spatstat}
positions <- start[, c("X", "Y")]
colnames(positions) <- c("x", "y")
window <- ripras(positions)
cell_positions <- as.ppp(positions, W = window)
marks(cell_positions) <- start
unitname(cell_positions) <- "micrometer"

## Get every pairwise distance between two cells.
all_distances <- nndist(cell_positions, by = "ID")

## Get the cell ID which in the nearest neighbor in every frame for every cell.
closest_by_time <- nnwhich(cell_positions, by = "Frame")
## Note that the first column is a special case
closest_by_time[, 1] <- 1:nrow(closest_by_time)
## Track the first cell over time
cell1_closest <- closest_by_time[1, ]
cell1_closest

cell2_closest <- closest_by_time[2, ]
cell2_closest

cell5_closest <- closest_by_time[5, ]
cell5_closest


## Each number of this vector is the ID of the cell over time.
cell1_areas <- marks(cell_positions)[cell1_closest, "Area"]

cell1_positions <- cell_positions[cell1_closest, ]
```

## Now iterate over each pair of time points 1:2, 2:3, 3:4, ... 120:121

Now lets add a little logic to extract just the cells for timepoints a,b

```{r iterate}
times <- unique(marks(cell_positions)[["Frame"]])
for (t in 2:max(times)) {
  first_idx <- marks(cell_positions)[["Frame"]] == t - 1
  second_idx <- marks(cell_positions)[["Frame"]] == t
  both_idx <- first_idx | second_idx
  subset <- cell_positions[both_idx, ]
  closest_pair <- nnwhich(subset, by = "Frame")
}
```

Ok, so I think I have a little bit of the idea now, let us try and
implement something closer to what Jacques actually wants in python.

# Loading necessary python libraries

I am writing this document in R markdown and using the ipython REPL to
run the code because my confidence in my python fluency is not very
high (I still kind of think in python2 sometimes because that is what
I learned first) and I make stupid mistakes pretty regularly; so
getting that immediate feedback that my datastructures are wrong is
nice.

Unfortunately, the interaction between how I use rmarkdown and python
and ipython is not perfect.  Thus in the following python blocks you
may see a stupid 1+1 statement at the beginning, this is a very weird
work around to keep the statements I am sending from going haywire and
sending my text notes to python.

# Load the TSV data

In the above R implementation, I dumped the data to a tsv file, and
will just read that here for simplicity.

While I am testing, I will extract only the columns containing the ID,
x, y, and time(Frame).

```{python load_tsv}
file_location = os.path.join('distance_calculation', 'testing.tsv')
input = pandas.read_table(file_location)
input.columns.values[0] = "ID"
coords = input[["ID", "X", "Y", "Frame", "Area"]]
```

# Load the coordinates into a geospatial distance method

geopandas looks like it handles coordinate data in a similar fashion
to R's spatstat.geom. I am not sure if it will be able to
handle a z-stack; but I assume it will prove easy to extend the
implementation to points_from_xyz().

```{python geospatial}
gdf = geopandas.GeoDataFrame(
    coords,
    geometry = geopandas.points_from_xy(coords.X, coords.Y))
```

I think the easiest way to extract minimum distances in python is to
split this df into pairs by Frame followed by using the
sjoin_nearest() function to find nearest points from a->b.

## subset the gdf data

 In the following block, I will iterate over every time from 1 to
 end-1 and extract the elements of the gdf from those two time points.
 Then I will use the sjoin_nearest() to get me every time-pairwise
 closest distance as a new dataframe.

 These pairwise distance dataframes will be appended to the eponymous
 pairwise_distances array; which upon completion will comprise n-1
 elements where the first element includes the results for timepoints
 1,2 and the last element includes n-1,n.

```{python gdf_subset}
tt = 1 + 2

final_time = gdf.Frame.max()
pairwise_distances = []
for start_time in range(1, final_time):
    i = start_time
    j = i + 1
    ti_idx = gdf.Frame == i
    tj_idx = gdf.Frame == j
    print("Getting distances of dfs ", i, " and ", j, ".")
    ti = gdf[ti_idx]
    tj = gdf[tj_idx]
    ti_rows = ti.shape[0]
    tj_rows = tj.shape[0]
    titj = geopandas.sjoin_nearest(ti, tj, distance_col = "pairwise_dist")
    pairwise_distances.append(titj)
```

## Filter the pairwise distances and find neighbors over time

Given the pairwise_distances array, I should be able to iterate over
it and make a series of arrays containing the cell IDs which are
composed of cells which are closest to each other over time.

The primary caveat here is that we know that sometimes the cells are
merged or other shenanigans happen which cause the closest distance to
get relatively large, thus I will filter out any pairwise distances
which are greater than max_dist.

As a result, the number of cells observed over time will by definition
not be the ~ 20 observed in the 121 timepoints, but will be some
larger number comprised of the sum over time.  A likely next step will
be to merge these partial groups post-facto, but not right now, I just
want to get this initial implementation working.

Here is the logic:

1.  Use id_counter to make a global cell ID across all timepoints.
2.  Use traced{} to associate the id_counter cell ID to an array of
    observed cells over time.
3.  Use ends{} to associate every known current endpoint with a
    id_counter cell ID so that we can keep track of who is who.
4.  Then iterate over every element of pairwise_distances and do the following:
  a.  drop elements with distances > max_dist.
  b.  check to see if the early-timepoint nearest neighbor
      (start_cell) has been seen in ends{}:
    * If so add end_cell to it and to the appropriate element of traced{}.
    * If not, initialize traced{} with the start/end and add it to ends{}.

Upon completion I should have a dictionary of arrays containing all of
the contiugous nearest-neighbor cells that stayed within max_dist of
each other.

```{python extract_closest}
## Start over if the distance between two cells is bigger than 10 units.
max_dist = 10.0
## Start over if the proportion between the bigger and smaller cell area has changed
## by more than a factor of 0.25.
max_prop = 0.7
id_counter = 0
## Cell IDs pointing to a list of cells
traced = {}
## Endpoints pointing to the cell IDs
ends = {}
for i in range(0, final_time - 1):
    query = pairwise_distances[i]

    passed_idx = query.pairwise_dist <= max_dist
    failed_idx = query.pairwise_dist > max_dist
    if (failed_idx.sum() > 0):
        print("Skipped ", failed_idx.sum(), " elements in segment ", i,
              " because distance is too large.")
        query = query[passed_idx]

    prop_change = query.Area_left / query.Area_right
    increased_idx = prop_change > 1.0
    prop_change[increased_idx] = 1.0 / prop_change[increased_idx]
    failed_idx = prop_change < max_prop
    passed_idx = prop_change >= max_prop
    if (failed_idx.sum() > 0):
        print("Skipped ", failed_idx.sum(), " elements in segment ", i,
              " because the size changed too much.")
        query = query[passed_idx]

    for row in query.itertuples():
        start_cell = row.ID_left
        end_cell = row.ID_right
        if start_cell in ends.keys():
            cell_id = ends[start_cell]
            current_value = traced[cell_id]
            current_value.append(end_cell)
            traced[cell_id] = current_value
            ends[end_cell] = cell_id
        else:
            id_counter = id_counter + 1
            traced[id_counter] = [start_cell, end_cell]
            ends[end_cell] = id_counter

traced
```

Ok, I looked at the result of traced and it seems to make sense to me;
lets rewrite this as a single function that Jacques can add to his
notebook.  I will assume he has a pandas df as input.

## Get information from a group of cells

As a final step, we should be able to extract and play with the
information from one or more groups of cells.

```{python get_info}
cell_id = 6
cell_idx = input['ID'].isin(traced[cell_id])
cell_data = input[cell_idx].reset_index()

plt.scatter(cell_data['X'], cell_data['Y'])
final_row = cell_data.Frame.max()
for start_time in range(1, final_row - 1):
    ti_idx = cell_data.Frame == start_time
    tj_idx = cell_data.Frame == start_time + 1
    p1x = cell_data[ti_idx].X
    p2x = cell_data[tj_idx].X
    p1y = cell_data[ti_idx].Y
    p2y = cell_data[tj_idx].Y
    x_points = [p1x, p2x]
    y_points = [p1y, p2y]
    plt.plot(x_points, y_points)
plt.show()

seaborn.violinplot(data = cell_data.Area)
plt.show()
```

# Single function definition

Now let us combine the things learned above into a single function.

```{python nearest_neighbor_function}
def nearest_cells_over_time(df, max_dist = 10.0, x_column = 'X', y_column = 'Y'):
    """Trace cells over time"""
    gdf = geopandas.GeoDataFrame(
        df,
        geometry = geopandas.points_from_xy(df[x_column], df[y_column]))

    final_time = gdf.Frame.max()
    pairwise_distances = []
    for start_time in range(1, final_time):
        i = start_time
        j = i + 1
        ti_idx = gdf.Frame == i
        tj_idx = gdf.Frame == j
        print("Getting distances of dfs ", i, " and ", j, ".")
        ti = gdf[ti_idx]
        tj = gdf[tj_idx]
        ti_rows = ti.shape[0]
        tj_rows = tj.shape[0]
        titj = geopandas.sjoin_nearest(ti, tj, distance_col = "pairwise_dist")
        pairwise_distances.append(titj)

    max_dist = 10.0
    id_counter = 0
    ## Cell IDs pointing to a list of cells
    traced = {}
    ## Endpoints pointing to the cell IDs
    ends = {}
    for i in range(0, final_time - 1):
        query = pairwise_distances[i]
        passed_idx = query.pairwise_dist <= max_dist
        failed_idx = query.pairwise_dist > max_dist
        if (failed_idx.sum() > 0):
            print("Skipped ", failed_idx.sum(), " elements in segment ", i, ".")
        query = query[passed_idx]
        for row in query.itertuples():
            start_cell = row.ID_left
            end_cell = row.ID_right
            if start_cell in ends.keys():
                cell_id = ends[start_cell]
                current_value = traced[cell_id]
                current_value.append(end_cell)
                traced[cell_id] = current_value
                ends[end_cell] = cell_id
            else:
                id_counter = id_counter + 1
                traced[id_counter] = [start_cell, end_cell]
                ends[end_cell] = id_counter

    return traced
```

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```
