Lab 7/8 (LHC)

Overview

This is the final lab for the LHC data. The goal is to apply an event selection to the pseudo-data (data_highLumi_pt_250_500.h5 Download data_highLumi_pt_250_500.h5, data_lowLumi_pt_250_500.h5 Download data_lowLumi_pt_250_500.h5, data_highLumi_pt_1000_1200.h5 Download data_highLumi_pt_1000_1200.h5, data_lowLumi_pt_1000_1200.h5 Download data_lowLumi_pt_1000_1200.h5) and report significance and 95% confidence level upper limits on the signal yield. Along the way, you should accomplish the following:

1. Develop an improved event selection (compared to lab 5/6). You should include at least 3 variables. You can use rectangular cuts as before or define an arbitrary decision surface as you choose. As an approximation for the median significance, you should use LaTeX: Z_0 \approx \sqrt{2((s + b)ln(1 + s/b) - s)}Z02((s+b)ln(1+s/b)s)(see eq. 97 for reference https://arxiv.org/pdf/1007.1727.pdf Links to an external site.). 


2. Comment on the new approximation vs the one we used in lab 5/6: LaTeX: Z_0 \approx \frac{s}{\sqrt{b}}Z0sb. For what values of s and b are the two similar/different? The new approximation is more accurate. Is there a region in which the two approximations are different that you are more likely to reach in this lab? Can you imagine other situations/reasons why even our new approximation might fail? (Hint: how do we know s and b? Is there uncertainty?)


3. Plot the approximate median expected significance of your optimized event selection as a function of the total yield (i.e., keeping n_higgs / n_qcd fixed but varying n_higgs + n_qcd). Note the point at which your expected significance is below 5 sigma. Finally, make a 2D plot of n_higgs + n_qcd vs n_higgs, with the color at each point indicating the expected significance.

4. Verify the plots above by running computational experiments with Poisson statistics and calculating the exact significance using scipy stats. In other words, create a Poisson distribution with mean = n_higgs, another Poisson distribution with mean = n_qcd and run many trials. For each trial, determine the number of qcd & higgs events that would pass your selection (assuming the efficiencies that you calculated from your cuts are exact). Sum them to get the total number of events, then calculate the significance for that number of events for your null hypothesis. Find the median significance over your trials. Repeat this for many values of n_higgs and n_qcd to re-create the plots in 3. Finally, plot a histogram of the significance for the case where n_higgs and n_qcd correspond to the expected yields and indicate the median according to your computational experiments and according to the approximation from 1.


5. For the high luminosity and low luminosity pseudo-data given above, plot the observed data overlapped with expected signal and background, normalized to the observed yield, with/without your event selection.


6. Calculate the observed significance. If it is less than 5 sigma, calculate the 95% confidence level upper limit on the signal yield. Compare to your expected significance curves calculated in step 3.

Finally, the bonus problem for this lab is to use supervised machine learning to develop your event selection. If you are interested, you can check out 1. Supervised learning — scikit-learn 1.1.3 documentation Links to an external site. and discuss with the instructors.