Technical matters

Working files

Segway must create a number of working files in order to accomplish its tasks, and it does this in the directory specified by the required workdir argument. When running training workdir is TRAINDIR; when running identification workdir is IDENTIFYDIR.

The observation files can be quite large, taking up 8 bytes per track per position and cannot be compressed. As a result they are written out to a temporary directory on an as-needed basis. This is because otherwise they could take terabytes for identifying on the whole human genome with dozens of tracks.

You will find a full description of all the working files in the Workdir files section.

Temporary files

The identify and posterior tasks create temporary observation files in directories indicated by the Python tempfile.gettempdir() function, which searches for an appropriate directory as described in the documentation for tempfile.tempdir <https://docs.python.org/2/library/tempfile.html#tempfile.tempdir>. If you need to specify that temporary files go into a particular directory, set the TMPDIR environment variable. It is highly recommended that you ensure that you temporary directory does not reside on a slow storage medium such as a NFS filesystem. Since many temporary files are created and deleted this can significantly impact performance.

Distributed computing

Segway can currently perform the training, identification, and posterior tasks only using a cluster controllable with the DRMAA interface. We have only tested it against Sun Grid Engine and Platform LSF, but it should be possible to work with other DRMAA-compatible distributed computing systems, such as PBS Pro, PBS/TORQUE, Condor, or GridWay. If you are interested in using one of these systems, please open an issue on Github <https://github.com/hoffmangroup/segway/issues/new> to correct all the fine details. A standalone option exists when you set the SEGWAY_CLUSTER environment variable to local. Try installing the free Open Grid Scheduler on your workstation if you want to run Segway without a full clustering system.

The --cluster-opt option allows the specification of native options to your clustering system — those options you might pass to qsub (SGE) or bsub (LSF).

Memory usage

Inference on complex models or long sequences can be memory-intensive. In order to work efficiently when it is not always easy to predict memory use in advance, Segway controls the memory use of its subtasks on a cluster with a trial-and-error approach. It will submit jobs to your clustering system specifying the amount of memory they are expected to take up. Your clustering system should allocate these jobs such that the amount of memory on one host is not overcommitted. If a job takes up more memory than allocated, then it will be killed and restarted with a larger amount of memory allocated, along the progression specified in gibibytes by --mem-usage=progression. The default progression is 2,3,4,6,8,10,12,14,15.

We usually train on regions of no more than 2,000,000 frames, where a single frame contains the number of nucleotides set by the --resolution option. If you use more than that many, GMTK might run out of dynamic range. This manifests itself as a “zero clique error.” Identify mode rescales probabilities at every frame so that this is not a problem. However, you will probably want to split the input sequences somewhat because larger sequences make more difficult work units (greater memory and run time costs) and thereby impede efficient parallelization. The --split-sequences=size option will split up sequences into windows with size base pairs each. The default size is 2,000,000. Decreasing to 500,000 will greatly improve speed at the cost of more artefacts at split boundaries.

Reporting

Segway produces a number of logs of its activity during tasks, which can be useful for analyzing its performance or troubleshooting. These are all in the workdir/log directory.

Shell scripts

Segway produces three shell scripts in the log directory that you can use to replay its subtasks at different levels of abstractions. The top-level segway.sh records the command line used to run Segway. The run.sh script gives you the GMTK commands called by Segway. A small number of these are still produced when --dry-run is specified. The details.sh script contains the exact commands dispatched by Segway, including wrapper commands that monitor memory usage, create and delete local temporary files with observation data, and convert GMTK’s output to BED, among other things.

Segway also writes a cmdline directory in both the traindir and identifydir. Each instance has its own folder, and for each job segway queues, a shell script (with the job’s name) is written containing the GMTK command of the queued job.

For example, traindir/cmdline/0/emt0.0.0.uuid.sh is the shell script containing the GMTK commands and arguments for job emt0.0.0.uuid in instance 0 of training.

Summary reports

The jobs.*.tab file contains a tab-delimited file with each job Segway dispatched for this instance in a different row, reporting on job identifier (jobid), job name (jobname), GMTK program (prog), number of segment labels (num_segs), number of frames (num_frames), maximum memory usage (maxvmem), CPU time (cpu) and exit/error status (exit_status). Jobs are written as they are completed. The exit status is useful for determining whether the job succeeded (status 0) or failed (any other value, which is sometimes numeric, and sometimes text, depending on the clustering system used).

The likelihood.*.tab files each track the progression of likelihood during a single instance of EM training. The file has a single column, one for each round of training, which contains the log likelihood. More positive values are better.

GMTK reports

The jt_info.txt and jt_info.posterior.txt files describe how GMTK builds a junction tree. It is of interest primarily during GMTK troubleshooting. You are unlikely to use it.

Task output

The output directory contains the output of the actual GMTK commands run by Segway. The o directory contains standard output and the e directory contains standard error. If a job fails and repeats, the output from the new job is appended to the old. The --verbosity=verbosity option controls how much diagnostic information that GMTK writes into these files. The default and minimum value is 0. Raise this value for more information, and see the GMTK documentation for a description of various levels of verbosity. Setting verbosity=30 can be particularly helpful in diagnosing model problems. Keep in mind that very high values (above 60) will produce tons of output===maybe terabytes.

Warning

Running Segway in identify mode with non-zero verbosity is currently not supported and may result in errors.

Performance

Some factors that affect compute time and memory requirements:

  • the length of the longest region you are training or identifying on

  • the number of tracks

  • the number of labels

The longest region forms a bottleneck during training because Segway cannot start the next round of training before all regions in the previous round are done. So if you specify three regions, one of which is 10 Mbp long, and the other are 100 kbp, the 10 Mbp region is going to be a limiting factor. You can use --split-sequences (see above) to put an upper bound on region size.

Names used by Segway

Workdir files

Segway expects to be able to create many of these files anew. To avoid data loss, by default, it will quit if they already exist. If you use the --clobber option, Segway will overwrite the whole workdir instead.

Filename

Description

accumulators/

intermediate files used to pass E-step results to the M-step of EM training

acc.*.bin

accumulator for a particular instance and region (reused each round)

auxiliary/

miscelaneous model files

dont_train.list

defines list of hidden random variables that are not trained

segway.inc

C preprocessor (cpp) include file used in structure

cmdline

shell scripts containing commandlines/arguments of individual GMTK jobs

cmdline/0,1,.../

GMTK job shell scripts for a particular training instance(0,1,…)

cmdline/identify/

GMTK job shell scripts for identification

intermediate

files containing best training filenames per instance

train_result.*.tab

per instance information containing the filenames of the params resulting in the best likelihood

likelihood/

GMTK’s report of the log likelihood for the most recent M-step of EM training

likelihood.*.ll

contains text of the last log likelihood value for an instance. Segway uses this to decide when to stop training

validation.output.*.ll

contains text of the last validation GMTK output for an instance

validation.output.winner.*.ll

contains text of the current best validation GMTK output for an instance

validation.sum.*.ll

contains text of the last validation set log likelihood for an instance

validation.sum.winner.*.ll

contains text of the current best validation set log likelihood for an instance

log/

diagnostic information

details.sh

script file that includes the exact command-lines queued by Segway, with wrapper scripts

jobs.*.tab

tab-delimeted summary of jobs queued per instance,

including resource informatoin and exit status

jt_info.txt

log file used by GMTK when creating a junction tree

jt_info.posterior.txt

log file used by GMTK when creating a junction tree in posterior mode

likelihood.*.tab

tab-delimited summary of log likelihood by training instance; can be used to examine how fast training converges

validation.sum.*.tab

tab-delimited summary of full validation set log likelihood by training instance

validation.output.*.tab

tab-delimited summary of validation log likelihood for each window in the validation set by training instance

run.sh

list of commands run by Segway, not including wrappers that create and clean up temporary files such as observations used during identification

segway.sh

reports the command-line used to run Segway itself

→ *.*.float32

continuous data for a particular region

→ *.*.int

indicator data (present/absent) for a particular region

float32.list

list of continuous data files

int.list

list of indicator data files

output/

diagnostic output of individual GMTK jobs

output/e/

stderr

output/e/0,1,

stderr for a particular training instance (0, 1, …)

output/e/identify

stderr for identification

output/o/

stdout

params/

generated and trained parameters for a given instance

input.*.master

generated hyperparameters and starting parameters

input.master

best set of hyperparameters and starting parameters

params.*.params.*

trained parameters for a given instance and round

params.*.params

final trained parameters for a given instance

params.params

best final set of trained parameters

segway.bed.gz

segmentation in BED format

segway.str

dynamic Bayesian network structure

train.tab

important file locations and hyperparameters used in training, to be passed to identify

triangulation/

triangulation files used for DBN interface

segway.str.*.*.trifile

triangulation file

viterbi/

intermediate BED files created during distributed Viterbi decoding, which get merged into segway.bed.gz

window.bed

a BED file containing chromosome regions and the indicies Segway assigns to them

Job names

In order to watch Segway’s progress on your cluster, it is helpful to understand how it names jobs. A job name for the training task might look like this:

emt0.1.34.traindir.ed03201cea2047399d4cbcc4b62f9827

In this example, emt means expectation maximization training, the 0 means instance 0, the 1 means round 1, and the 34 means window 34. The name of the training directory is traindir, and ed03201cea2047399d4cbcc4b62f9827 is a universally unique identifier for this particular Segway run. This can be useful if you want to manage all of your jobs on your clustering system with wildcard specification. On SGE you can delete all the jobs from this run with:

qdel "*.ed03201cea2047399d4cbcc4b62f9827"

On LSF, use:

bkill -J "*.ed03201cea2047399d4cbcc4b62f9827"

Jobs created in the identify (vit) or posterior (jt) task are named similarly:

vit34.identifydir.4f32630d53724f08b34a8fc58793307d
jt34.identifydir.4f32630d53724f08b34a8fc58793307d

Of course, there are no instances or rounds for the identify task, so only the sequence index is reported.

Tracks

Tracks are named according to their name in the Genomedata archive. For GMTK internal use, periods are converted to underscores.