Lab 7/8 (Hera)

Overview

The purpose of this lab is to finalize your analysis, ultimately reporting frequencies that correspond to statistically significant brightness. This lab will be conducted over two weeks and is intended to be more open-ended, as an opportunity for you to practice your skills.

As a reminder, the scientific task we are trying to solve with the Hera data is to identify radio frequencies that are emitted by distant galaxies. We can exploit several features of this type of signal, namely that we expect it to be constant over short periods of time and constant (in magnitude) across antenna pairs. Your final product should be a list of frequencies that correspond to a statistically significant constant brightness, with appropriate justification. The following ideas are helpful.

Background Sources

Beyond our 'signal' (described above), there are many possible backgrounds that we might pick up, including other stellar objects, near-earth objects, human TV and radio signals, and so on. To make progress, we need to make some simplifying assumptions. Most importantly, we assume that there are no 'impostor' backgrounds, i.e., background sources that have identical characteristics to our signal as we have described it (constant across antennas and time). Beyond that, we will split our data into three distributions: signal, thermal noise, and RFI (radio frequency interference).

RFI Contamination

RFI is hard to model, as it comprises many different sources (see this paper Links to an external site. for more information, as well as useful explanations of many of the concepts in this lab). Our approach in this lab is to filter out frequencies with large RFI. You can continue to do so in the manner of lab 6 or make adjustments as you see fit. One option is to additionally consult Radio Spectrum - 30 MHz to 144 MHz (jneuhaus.com) Links to an external site. and IEEE_REACH_South_Africa_Spectrum_Allocation_chart_2-9.pdf Links to an external site., which list frequencies that are in common use in South Africa.

The bonus problem for this lab is to implement a filter based on the mean-subtracted incoherent noise spectra, as described in section 2 of the paper linked previously. In essence, the goal is to identify differences in visibility (averaged across many baselines) that have a large significance compared to the average thermal noise, which approaches a normal distribution as a result of the baseline averaging.

Thermal Noise

Thermal noise is the background that we will actually model and use as the basis of our null hypothesis test. It is known that the thermal noise follows a complex Gaussian distribution, that is, the thermal background LaTeX: b=x+iy $b=x+iy$ where x and y are identically distributed and follow a normal distribution $LaTeX: N(0, \sigma)$ $N(0, \sigma)$ (i.e., they have 0 mean and $LaTeX: \sigma$ $\sigma$ standard deviation). To estimate the background, we can consider the difference in visibilities $LaTeX: \Delta V_{jk} = V_j - V_k$ $\Delta V_{jk} = V_j - V_k$ . Since the signal is constant: $LaTeX: \Delta V_{jk} = b_{j} - b_{k} = (x_j - x_k) + i(y_j - y_k)$ $\Delta V_{jk} = b_{j} - b_{k} = (x_j - x_k) + i(y_j - y_k)$ . Since x and y are normally distributed, as already discussed, then LaTeX: x_j - x_k, y_j - y_k $x_j - x_k, y_j - y_k$ are also normally distributed, except as $LaTeX: N(0, \sqrt{2}\sigma)$ $N(0, \sqrt{2}\sigma)$ . Then, $LaTeX: |\Delta V_{jk}|$ $|\Delta V_{jk}|$ follows a Rayleigh distribution, with parameter $LaTeX: \sigma_{\Delta} = \sqrt{2} \sigma$ $\sigma_{\Delta} = \sqrt{2} \sigma$ .

Finally, as a brief foreshadowing of the significance testing we will do later, note that our null hypothesis is that LaTeX: s = 0 $s = 0$ , such that LaTeX: V = s + b = b $V = s + b = b$ . In this case, LaTeX: |V| $|V|$ follows a Rayleigh distribution with mean $LaTeX: E[|V|] = \sqrt{\frac{\pi}{2}} \sigma = \sqrt{\pi}\sigma_{\Delta}$ $E[|V|] = \sqrt{\frac{\pi}{2}} \sigma = \sqrt{\pi}\sigma_{\Delta}$ and variance $LaTeX: E[|V|^2] - E[|V|]^2 = \frac{4 - \pi}{2}\sigma^2 = \frac{4 - \pi}{4}\sigma_{\Delta}^2$ $E[|V|^2] - E[|V|]^2 = \frac{4 - \pi}{2}\sigma^2 = \frac{4 - \pi}{4}\sigma_{\Delta}^2$ .

Test Statistic

The simplest quantity we could consider is LaTeX: |V| $|V|$ , the magnitude of the visibility. However, for each frequency, there are multiple values of $|V|$ corresponding to different baselines and times. Since our signal is constant, we could imagine that averaging over these other axes could increase our sensitivity. In such a case, we can invoke the central limit theorem. Suppose our background distribution for LaTeX: |V| $|V|$ has mean $LaTeX: \mu_{b}$ $\mu_{b}$ and standard deviation $LaTeX: \sigma_{b}$ $\sigma_{b}$ (not to be confused with $LaTeX: \sigma$ $\sigma$ , which is the standard deviation for the Gaussian distribution of the normal and imaginary parts of the thermal noise). We can calculate these quantities from the formulae for a Rayleigh distribution mentioned in the thermal noise section and our fit from $LaTeX: |\Delta V|$ $|\Delta V|$ . If we average this distribution $LaTeX: N = N_{baselines} \times N_{times}$ $N = N_{baselines} \times N_{times}$ times, then the CLT tells us that distribution will approach $LaTeX: N(\mu_b, \sigma_b / \sqrt{N})$ $N(\mu_b, \sigma_b / \sqrt{N})$ .

Using the information above, we can therefore move from our background fit of $LaTeX: |\Delta V|$ $|\Delta V|$ to a background distribution for our test statistic for each frequency, where our test statistic is the average of the magnitude of the visibilities over all baselines and times. Note that you may need a different background distribution for different frequencies/baselines/times, depending on what you find out when you investigate the data. For simplicity, there is always the option to only consider a range of frequencies/baselines/times, or to split up your analysis depending on the range.

Recommended Steps

1. Eliminate RFI contamination as you have done in previous labs.
2. Investigate $LaTeX: |\Delta V|$ $|\Delta V|$ and characterize the thermal noise. In particular, $LaTeX: |\Delta V|$ $|\Delta V|$ is a function of frequency, time and baseline pair. Make histograms, waterfall plots, and line plots of $LaTeX: |\Delta V|$ $|\Delta V|$ and $LaTeX: \langle |\Delta V| \rangle$ $\langle |\Delta V| \rangle$ (the brackets indicate an average over one or more of the three axes [frequency, time, baseline]). If you restrict to a particular frequncy, time and baseline pair, then according to theory it should follow a Rayleigh distribution. As you include data from more points on these axes, then you can check whether it still looks like a Rayleigh distribution or not. If it doesn't, then the background likely depends on that axis in some way, and to make your job easier, you should apply filters based on these axes such that your background is relatively constant. Or, if you want to keep more data, determine how to adjust your background model based on these axes (e.g., adjust the background based on a frequency range, or individual frequency). Note: at this point, you could go back and do the bonus question for RFI contamination, if you wanted.
3. Finally, perform one or more statistical fits to find the Rayleigh parameter $LaTeX: \sigma_{\Delta}$ $\sigma_{\Delta}$ for the background regions you found in step 2. Using the theory discussed above, use this to determine the distribution of the thermal noise for the null hypothesis. Finally, find the significance of your measured test statistic for each frequency according to your null hypothesis. Report the frequencies that are significant and create a waterfall plot for them.