Tutorial v1.0.0¶
The aim of this tutorial is to give a basic introduction to the main workflow of
scousepy
. Each stage will be described below along with some of the
important customisable keywords. All data for the tutorial can be found here, along with some example
scripts.
Data¶
This tutorial utilises observations of N2H+ (1-0) towards the Infrared Dark Cloud G035.39-00.33. This data set was first published in Henshaw et al. 2013.. These observations were carried out with the IRAM 30m Telescope. IRAM is supported by INSU/CNRS (France), MPG (Germany) and IGN (Spain). The data file is in fits format and is called
n2h+10_37.fits
Getting Started¶
scousepy
requires several key parameters that are worth setting as global
parameters. These can be set in the following way
filename = 'n2h+10_37'
datadirectory = './'
wsaa = [8.0]
outputdir = './myoutputdirectory/' # optional
ppv_vol = [32.0,42.0,None,None,None,None] # optional
mask_below = 0.3 # optional
Here filename
is provided without the .fits extension. datadirectory
points to the location of the data. By default scousepy
will create a new
directory in the same location as the input file. However, this can be changed
with the optional keyword outputdir
. wsaa
is used during stage 1. It
defines the maximum width of the region used to generate spatially averaged
spectra (see below).
The optional keywords listed above include ppv_vol
, which allows the user to
control the region of the datacube over which to perform the fitting, and
mask_below
which is used to mask the integrated intensity map produced
during stage 1. ppv_vol
is provided as a list and given in the format
ppv_vol = [min_vel, max_vel, min_y, max_y, min_x, max_x]
.
These are the ranges over which the fitting will be performed. This tutorial
is tuned to the n2h+ data, which has hyperfine structure. Here we implement the
range 32-42 km/s since this allows us to focus on the isolated hyperfine
component, which we model as a Gaussian.
Stage 1: defining the coverage¶
As discussed in the description of the code, the purpose of stage 1 is to identify the spatial area over which to fit the data. Stage 1 requires several key parameters. To run stage 1 you can use
s = scouse.stage_1(filename, datadirectory, wsaa,
ppv_vol=ppv_vol,
outputdir=outputdir,
mask_below=mask_below,
fittype='gaussian',
verbose = True,
write_moments=True,
save_fig=True)
Here I’ve used the keywords write_moments=True
and save_fig=True
. The
former will produce FITS files of the zeroth, first, and second order moments of
the data. It will also create a FITS file containing the spectral coordinate of
the maximum value of the spectrum. The latter keyword creates a simple plot of
the coverage that will be used during the fitting procedure. The terminal output
should look something like this…
Where it tells us that we will have to fit a total of 12 spatially averaged spectra and that the total number of spectra to fit is 126. The output coverage map for this particular tutorial is not much to look at, but here it is anyway…
Below is a slightly more interesting example where the refine_grid
keyword
has been set. This outputs spectral averaging areas of different sizes. The
basic idea is to reduce the size of the SAAs based on the spectral complexity
in a given location. In this case we define spectral complexity by making a
comparison between the moment 1 map (intensity-weighted average velocity) and
the velocity at peak emission map. In theory, if the spectrum is simple, e.g.
a single Gaussian, the difference between these two values should be close to 0.
On the other hand, if the spectrum is complex then the difference between these
two values is likely to be > 0. Below is the coverage defined for the central
molecular zone cloud G0.253+0.016, and presented in Henshaw et al. 2019. The
SAA sizes decrease from blue - red - yellow, with the yellow regions exhibiting
the most complex line-profiles.
Stage 2: fitting the spectral averaging areas¶
Stage 2 is where we will perform our manual fitting. It is simple to run using
s = scouse.stage_2(s, verbose=True, write_ascii=True)
where the keyword write_ascii
has been set to output the best-fitting
solutions as an ascii file at the end of the fitting procedure. The fitting
process is based on the interactive process of pyspeckit. Initialising the fitter will look
a bit like this..
where we will have an indication of how many spectra we have to fit (and how
many we have already fitted), as well as some important info for the pyspeckit
interactive fitter. Upon running stage 2, a window should have popped up where
one of the spatially averaged spectra will be displayed. Interactive fitting
can be performed using several commands. To indicate components you would like
to fit select each component twice, once somewhere close to the peak emission
and another click to indicate (roughly) the full-width at half-maximum. In my
experience with this, you don’t need to be particularly accurate, pyspeckit
does an excellent job of picking up the components you have selected. Selection
can be made either using the keyboard (m
) or mouse. Once selected this will
look something like this…
If you are happy with your fit, hitting d
will lock it in. The resulting
fit will be plotted. At this point scousepy
will output some useful information
to the terminal…
and will ask if you’re happy with the fit. If the fit looks good, press enter to continue. This will lock the fit in and overplot the individual components…
scousepy
will then move onto the next spectrum. If you’re not happy with the
fit you can always re-enter the interactive fitter by typing f
. Repeat this
process until the fitting is completed. Note that as a bit of a time-saver
scousepy
will automatically try and fit the each spectrum with the previous
spectrum’s best-fitting solution. Often this leads to good fits and its possible
to simply press enter to cycle through a good chunk of the fits. However, this
can also obviously lead to some terrible fits, particularly if there is a big
jump in spatial location and therefore the spectrum changes considerably. If
this happens simply type f
to re-enter the interactive fitter.
For large datasets its worth noting that there are a couple of keywords here
that might be useful, particularly bitesize
. This enables the user to
perform bitesize fitting where the process is broken down into sessions and the
user fits a fixed number of spectra at any one time. The number of spectra to
fit in any one session can be controlled using the nspec
keyword.
Stage 3: automated fitting¶
Stage 3 represents the automated decomposition stage. scousepy
will take you
best-fitting solutions from stage 2 and pass these to the individual spectra
located within each SAA. The fitting process is controlled by a number of
tolerance levels which are passed to scousepy
in the form of a list or an
array
njobs = 4
tol = [3.0, 2.0, 3.0, 3.0, 0.5]
s = scouse.stage_3(s, tol, njobs=njobs, verbose=True)
The tolerance levels are descibed more completely in Henshaw et al. 2016, however, in short, the first tolerance here is the S/N ratio each component must satisfy, the second is the minimum width of each component given as a multiple of the channel spacing, the third and fourth control how similar (in terms of velocity and width) a component must be to the closest matching component in the SAA fit, and the final tolerance governs the minimum separation between two components for them to be considered distinguishable (it is given as a multiple of the width of the narrowest component).
Something else to consider is the njobs
keyword. The automated fitting procedure
is parallelised. Currently the parallelisation works by sending the fitting of
each SAA to a different node (rather than individual spectra). Perhaps there is
a better way to do this (suggestions welcome),
but it does lead to a bump in speed. I would say however, that in general
I’ve noticed a considerable reduction in speed in moving from IDL to Python.
This is particularly true in the case of complex data. For the G0.253+0.016 data
in Henshaw+ 2019, for example, 4, 5, 6 component fits are not uncommon. When
the number of components is high, each attempt may take ~ a second. If the fit
is unsatisfactory scousepy
will try a new fit with a different number of
components. When combined with the fact that SAAs are overlapped such that many
of the spectra are fit more than once, it is easy to see how the time begins
to ramp up. As a rule of thumb I’d advise against running this stage on a laptop
in the case of complex data. For the G0.253+0.016 data, I set this running on
~ 20 cores and it crunched away over night. In contrast, running stage 3 on a
similarly sized dataset, but one which is dominated by 1-2 components on roughly
the same number of cores completed in ~ half an hour. So it really depends on
the dataset and the time taken for completion will vary. On a positive note you
can just set it going and leave it to do its thing.
Here is an example of the output to terminal for the tutorial data. In this
case the fitting was not parallelised (njobs=1
)…
Stage 4: selecting the best-fitting solutions¶
Not much to say here - just set this stage running. Here the best-fitting solutions from stage 3 are compiled for each spectrum. Where the SAAs overlap there will be multiple fits for each spectrum. Duplicates are removed and the “best-fit” model solution is defined as that with the smallest AIC value.
s = scouse.stage_4(s, verbose=True)
Stage 5: quality control¶
Now we want to check our work. Stage 5 works interactively and is run in the following way
s = scouse.stage_5(s, blocksize = 6,
figsize = [18,10],
plot_residuals=True,
verbose=True)
First, a bunch of diagnostic plots will be created (these are all saved to the stage 5 directory). These include plots of the rms, residuals, reduced chi-squared, number of components, aic, and the chi-squared. An interactive pop up window will be displayed which will look something like this…
This is displaying the rms map. The displayed image can be changed by selecting a different number. For example, selecting 2 will display the reduced chi-squared…
By selecting somewhere in the map a new pop up will be displayed. The number of
spectra that will be shown can be defined with the blocksize
keyword. The
pop up window will look something like this…
The idea behind this stage is to select spectra you may want to take a closer look at, either because you want to see if an alternative fit is available or because the fit is bad and you want to re-fit manually. Clicking on a spectrum will save it for inspection in stage 6. In the below image, I have selected some of the spectra close to the edge of the map. Clearly here the noise is greater and the fits aren’t as good…
Note that if you are not happy with any of the spectra in a region you can
select all of them by pressing a
. If you make a mistake, or would simply like
to remove a spectrum from the selection, hover over the spectrum with the cursor
and press r
or d
. Pressing enter on the pop up window with the spectra will
close it and allow you to select another region. A shaded region will show you
where you have already checked…
For a small dataset like that in the tutorial it is easy to check all spectra. However, you may want to stick to a representative sample for much bigger datasets.
For my own personal work flow I tend to have a very quick run through stage 5 without actually selecting any spectra. The idea is to have a quick glance at whether or not the fits are any good. If not, or if I feel I will have to do a lot of re-fitting (maybe >5-10% of the data), I will stop at this stage and tweak the parameters of stage 3 and re-run from there. Once I’m happy that the majority of the fits are reasonable I will go through stage 5 in earnest.
Stage 6: refitting¶
Here we are going to look again at the spectra we selected during stage 5. Stage 6 is run using
s = scouse.stage_6(s, plot_neighbours=True,
radius_pix = 2,
figsize = [18,10],
plot_residuals=True,
write_ascii=True,
verbose=True)
The plot_neighbours
keyword here is optional. If set to True stage 6 will
begin by reminding you of each spectrum and its surroundings. This isn’t
particularly necessary for the tutorial data, but for large datasets, where it
easy to forget perhaps why a spectrum was selected for reevaluation, it can be
useful. If set to True, the first pop up window will look something like this…
Here, the central pixel is the first pixel we selected during stage 5 (see above).
It is at the edge of the map (hence the pixels with X values). Pressing Enter
will move scousepy
onto the next phase. At the next phase you will be
presented with a choice. scousepy
will show the current fit and any other
alternative solutions to this spectrum. In the example below no alternatives are
available. However, scousepy
will always provide a no-fit option. In my
personal experience I have found that sometimes having no-fit is better than
having a bad fit.
At this point you can make a choice. Either select the current spectrum, select an alternative, or press Enter to enter the interactive fitter you will now be familiar with from stage 2. Let’s say we aren’t happy with this fit but we also don’t think we can do a better job - we can click on the no-fit option like so…
This will select the no-fit option and move on to the next spectrum. Moving on,
we see one we want to re-fit. Pressing Enter will enter the interactive fitter
from pyspeckit
. Using the same commands as in stage 2 we can try to fit
two components like so…
and as before, scousepy
will give you a running commentary on what is
happening in the terminal…
Repeat this process until all of the spectra selected in stage 5 have been checked over. As with stage 2, stages 5 and 6 can be run in bitesize mode. This type of fitting is a little more fiddly and will be described below.
Complete Example¶
In reality you might not want to run all of the stages back to back. To save
memory each output file produced by scousepy
only contains essential
information. An example code to run scousepy
may therefore look something
like this…
from scousepy import scouse
from astropy.io import fits
import os
pl.ion()
def run_scousepy():
datadirectory = '../data/'
outputdir = '../output/simple_example_run/'
filename = 'n2h+10_37'
ppv_vol = [32.0,42.0,None,None,None,None]
wsaa = [8.0]
tol = [3.0, 1.0, 3.0, 3.0, 0.5]
verb = True
fittype = 'gaussian'
njobs = 4
mask = 0.3
#==========================================================================#
# Stage 1
#==========================================================================#
if os.path.exists(outputdir+filename+'/stage_1/s1.scousepy'):
s = scouse(outputdir=outputdir, filename=filename, fittype=fittype,
datadirectory=datadirectory)
s.load_stage_1(outputdir+filename+'/stage_1/s1.scousepy')
s.load_cube(fitsfile=datadirectory+filename+".fits")
else:
s = scouse.stage_1(filename, datadirectory, wsaa,
ppv_vol=ppv_vol,
outputdir=outputdir,
mask_below=mask,
fittype=fittype,
verbose = verb,
write_moments=True,
save_fig=True)
#==========================================================================#
# Stage 2
#==========================================================================#
if os.path.exists(outputdir+filename+'/stage_2/s2.scousepy'):
s.load_stage_2(outputdir+filename+'/stage_2/s2.scousepy')
else:
s = scouse.stage_2(s, verbose=verb, write_ascii=True)
#==========================================================================#
# Stage 3
#==========================================================================#
if os.path.exists(outputdir+filename+'/stage_3/s3.scousepy'):
s.load_stage_3(outputdir+filename+'/stage_3/s3.scousepy')
else:
s = scouse.stage_3(s, tol, njobs=njobs, verbose=verb)
#==========================================================================#
# Stage 4
#==========================================================================#
if os.path.exists(outputdir+filename+'/stage_4/s4.scousepy'):
s.load_stage_4(outputdir+filename+'/stage_4/s4.scousepy')
else:
s = scouse.stage_4(s, verbose=verb)
#==========================================================================#
# Stage 5
#==========================================================================#
if os.path.exists(outputdir+filename+'/stage_5/s5.scousepy'):
s.load_stage_5(outputdir+filename+'/stage_5/s5.scousepy')
else:
s = scouse.stage_5(s, blocksize = 6,
figsize = [18,10],
plot_residuals=True,
verbose=verb)
#==========================================================================#
# Stage 6
#==========================================================================#
if os.path.exists(outputdir+filename+'/stage_6/s6.scousepy'):
s.load_stage_6(outputdir+filename+'/stage_6/s6.scousepy')
else:
s = scouse.stage_6(s, plot_neighbours=True,
radius_pix = 2,
figsize = [18,10],
plot_residuals=True,
write_ascii=True,
verbose=verb)
return s
s = run_scousepy()
This format was largely introduced to prevent people from overwriting their hard work. In this way each stage can be run independently, without having to run the process from start to finish.
Tips and Tricks¶
One thing that I have found particularly useful in my own work is to break the
fitting process up into chunks. This can be really helpful if you have a lot of
spectra to fit. I have included a bitesized fitting process into scousepy
which can be run in stage 2 in the following way…
if os.path.exists(datadirectory+filename+'/stage_2/s2.scousepy'):
s.load_stage_2(datadirectory+filename+'/stage_2/s2.scousepy')
else:
s = scouse.stage_2(s, verbose=verb, write_ascii=True, bitesize=True, nspec=10)
s = scouse.stage_2(s, verbose=verb, write_ascii=True, bitesize=True, nspec=100)
Check out the Complete Example in the tutorial
section of the documentation to understand what is going on here with some more
context. However, in short, the first run I have used the keywords bitesize=True
and nspec=10
. This will fit the first 10 spectra as normal. Note the
indentation on the second call to stage 2. After the first run with 10 spectra
each subsequent call to the code will load the s2.scousepy
file and then
100 spectra will be fitted until the process is complete. Of course, you can
change this to whatever value you like.