# Full text of "Kepler Presearch Data Conditioning I - Architecture and Algorithms for Error Correction in Kepler Light Curves"

## See other formats

(N 6 Kepler Presearch Data Conditioning I — Architecture and Algorithms for Error Correction in Kepler Light Curves Martin C. Stumpe*'^'^, Jeffrey C. Smith^'^, Jeffrey E. Van Cleve^'^, Joseph D. Twicken^'^, Thomas S. Barclay^'^, Michael N. Fanelh^'^, Forrest R. Girouard^'^, Jon M. Jenlcins^'^, Jeffery J. Kolodziejczalc^, Sean D. McCauhff^'^, Robert L. Morris^'^ O (N ABSTRACT ^ Kepler provides hght curves of 156,000 stars with unprecedented precision. I^ However, the raw data as they come from the spacecraft contain significant sys- I I tematic and stochastic errors. These errors, which include discontinuities, sys- ^ tematic trends, and outliers, obscure the astrophysical signals in the light curves. To correct these errors is the task of the Presearch Data Conditioning (PDC) module of the Kepler data analysis pipeline. The original version of PDC in Kepler did not meet the extremely high performance requirements for the de- tection of miniscule planet transits or highly accurate analysis of stellar activity j^ and rotation. One particular deficiency was that astrophysical features were of- ten removed as a side-effect to removal of errors. In this paper we introduce ^ the completely new and significantly improved version of PDC which was imple- (N| mented in Kepler SOC 8.0. This new PDC version, which utilizes a Bayesian Z^ approach for removal of systematics, reliably corrects errors in the light curves ^— I while at the same time preserving planet transits and other astrophysically in- cn teresting signals. We describe the architecture and the algorithms of this new ^ PDC module, show typical errors encountered in Kepler data, and illustrate the ^~~' corrections using real light curve examples. > rS Subject headings: Stars; Extrasolar Planets; Data Analysis and Techniques email: niartin.stumpe@nasa.gov iNASA Ames Research Center, Moffett Field, CA, 94035 2 set: Institute ■^Bay Area Environmental Research Institute ^Orbital Sciences Corporation ^Marshall Space Fhght Center, Huntsville, AL 35812 Introduction 1.1. The Kepler Mission The Kepler Mission is the tenth mission in NASA's Discovery Program. Its main objec- tive is to detect Earth-size planets in the habitable zonaH iKasting et al. (1993); Koch et al. (2010)] of Sun-like stars and to determine what fraction rj^ of all stars in the milky way host such planets. Planet detection in Kepler is performed using transit photometry Deeg et al. (1998); Charbonneau et al. (2000)]: A planet transiting between its star and the Kepler photometer blocks a fraction of the stars light proportional to the planet's projected area relative to that of its star, and thus decreases the apparent brightness of the star. For that purpose, the Kepler spacecraft is in an Earth-trailing orbit and is continuously observing the incoming light of about 156,000 stars in a field of view of 115 deg^. The light flux from the stars is recorded using an array of 42 CCDs, arranged as 84 module output channels each of which has 1100x1024 pixels. Each CCD is read out in a 6.54 sec interval, and the flux for each target is integrated for 29.4 minutes, a so-called Long Cadence (LC) intervaP] The large number of targets is required because the probability for correct geometric alignment of a planet's orbit with the star-to-photometer-axis is the ratio of the star's radius and the planet orbit radius. This translates to a transit probability of only about 0.5% of all Earth-like planets in a 1 AU orbit around Sun-like stars. Therefore, a sufficiently large number of stars has to be observed to obtain a reliable estimate of the frequency of Earth-like planets in our galaxy (r/e). In addition to these LC targets, a subselection of 512 targets are sampled at 58.35 sec "Short-Cadence" (SC) intervals, which allows not only for more precise transit timing measurement, but also for detailed asteroseismological investigations of these stars, and for measurements of binary star eclipse timing variations. With the area ratio of the Earth relative to the Sun being ^ le~^, detection of transits of Earth-like planets requires an unprecedented photometric precision of 20 ppm Koch et al. (2010); Jenkins et al. (2010a)]. The search for the transits is further complicated by the presence of systematic errors in the photometric data, which occlude the transit signatures as well as other astrophysical signals in the light flux series. The main causes for the systematic errors are changes in the focus of the photometer (caused by instrinsic and extrinsic thermal ^The habitable zone is the range of orbit radii where temperatures allow for liquid water on the planet's surface. Kaltenegger & Sass (2011)] ^Note that the targets do not have to be stars. Some of Kepler's targets are in fact galaxies. However, this distinction is not relevant for the context of this paper. variations), differential velocity aberration, residual spacecraft pointing errors^ mechanical vibrations, and electrical interference with other devices onboard the spacecraft (typically leading to Moire like patterns). Here we report on the new Presearch Data Conditioning (PDC) module in the Kepler Science Operations Center (SOC) Science Processing Pipeline, which is specifically designed to remove these systematic errors and other artifacts from the light curves prior to planet detection. The new PDC module, which has been completely rewritten for Kepler SOC 8.0 (the official Kepler software release in August 2011), shows dramatically improved performance - in particular at removing systematic trends while preserving stellar variability, and at detecting and correcting flux discontinuities. 1.2. Kepler Science Processing Pipeline Photometric data from the Kepler spacecraft is processed through the Kepler Science Processing Pipeline Jenkins et al. (2010b); Middour et al. (2010)] to extract transit infor mation and export light curves that are readily prepared for asteroseismological analysis. The raw pixel data from the photometer are passed to the Calibration module (CAL) |Quin- tana et al. (2010)], which performs pixel-level calibration for target and background pixels. Digital apertures for each target are calculated by the Target Aperture Definition (TAD) module Bryson et al. (2010)]. Photometric Analysis (PA) [Twicken et al. (2010b)] then selects the pixels defined by the respective target aperture and extracts the light curves, or flux time series, by adding the flux values of these calibrated pixels. PA also labels anoma- lies in the data and marks the respective cadences as invalid ("gapped"). The flux output from PA is then passed to Presearch Data Conditioning (PDC) [Twicken et al. (2010a)], which removes systematic and other errors from the light curves, as described in this paper. These clean light curves from PDC are then passed as input to the Transiting Planet Search (TPS) [Tenenbaum et al. (2010); Jenkins et al. (2010c); Tenenbaum et al. (2012)] module which employs a wavelet-based adaptive matched filter to detect planet transit signatures ("Threshold Crossing Events", TCEs) in the light curves Jenkins (2002)]. Finally, the Data Validation (DV) module Wu et al. (2010)], performs a series of tests on the TCEs detected in TPS to establish or break confidence in each event, for instance eliminating false-positives resulting from eclipsing binaries, and to extract physical parameters for each system. The results from DV are passed to the Kepler Science Team for further analysis and follow-up observations. For storage and publication to the scientific user community, the processed ^Kepler's pointing stability is 0.009", 3 cr on time scales of 15min and greater. pixels and light curves are exported to the Multimission Archive at STScI (MASTrj In this manuscript we describe the new Presearch Data Conditioning module in Kepler SOC 8.0, which has been completely rewritten and replaces the previous PDC module. We will briefly describe the different kinds of systematic errors typically encountered in the light curves, compare the main systematic error cotrendin^ algorithm of the new module with that of the old module, explain the software architecture of the PDC module, and show real light curve examples of the vastly improved performance of the new PDC module. Since one of the major improvements of this completely new version of PDC is the Bayesian Maximum A Posteriori (MAP) approach to cotrending, we will also refer to the new version PDC in Kepler SOC 8.0 as "PDC-MAP", and to PDC pre-8.0 as "PDC-LS" (where LS stands for "least-squares" ) , when comparing the twc 1.3. Errors in the Light Curves: Tasks of PDC Kepler's light curves as they come from PA contain a variety of systematic and other errors, which can occlude the astrophysical features in the light curves, and can prevent detection of small planet transits. These errors are caused by a variety of instrumental effects, and span a wide range of frequencies and amplitudes. The most dominant errors mainly result from Differential Velocity Aberration (DVA), which can introduce low-frequency systematic errors over the whole quarter. Another source of strong systematic errors, typically with periods of hours to days, are focus changes that are caused by thermal transients. Such temperature changes are usually observed during the recovery from planned or unplanned interrupts in the science data collection. Unplanned interrupts include safe- mode events and loss of fine point. Planned events that interrupt science data collection are the monthly "Earth-points" and the quarterly rolls. The former is spatial realignment of the Kepler spacecraft to point its high-gain antenna towards Earth for data downlink Haas et al. (2010)] and typically lasts about one day. The latter is a 90° rotation around Kepler's longitudinal axis every three months to keep the solar arrays directed towards the Sun, and the radiator which cools the CCD away from it Haas et al. ^http : //archive . stsci . edu/index . html ^We refer to the detrending here as "cotrending" to differentiate it from techniques that do not use prior knowledge of the instrumental signatures, be it from correlation in the light curves or from engineering data. ''Note however that PDC-MAP and PDC-LS refer to the whole PDC module, rather than only the MAP or LS fitting algorithms. (2010)]. Temperature-change related focus changes can also be faster, such as the intermit- tent modulation of the focus by ^ 1 fim every 3.2 hours by a heater on one of the reaction wheel housings, which was partially mitigated after quarter 1 Jenkins et al. (2011)]. The magnitude of temperature-related effects on the focus has been found to be ~ 2.2/im per °C I Jenkins et al. (2011)]. Fast modes in the systematic errors can also be caused by ther- mal transients resulting from the reaction-wheel-desaturation on a three day period, and by pointing inaccuracies during the zero-crossings of the reaction- wheels. In addition to these systematic trends and modulations, the PA light curves suffer from local errors such as outliers, discontinuities, Argabrightenings Witteborn et al. (2011)], and gaps in the data. Further, there are also global corrections that have to be applied to the flux amplitude, such as corrections for incompletely captured target flux (the flux fraction. Section 2.3.11), and for excess light from neighboring stars into the aperture (crowding metric). These effects are illustrated in Figure [I] and described in more detail below. The task of the PDC module in Kepler is to identify and correct all these different kinds of systematic errors, while preserving planet-transits and other astrophysical signals in the light curves. 2. Architecture and Algorithms of PDC 2.1. Cotrending of Systematic Errors in PDC The main component of PDC-MAP for removal of systematic trends in the light curves, and arguably the most determinant factor for the significantly improved correction perfor- mance as compared to PDC-LS (before Kepler SOC 8.0), is the new cotrending routine which is based on a Bayesian Maximum A Posteriori approach DeGroot (1970)]. In general. removal of systematic trends in Kepler data is based on the assumption that the systematic error component in a light curve has a significant degree of correlation with a set of other time series that can be exploited to identify and remove the systematic error. For PDC-LS, this set of other time series was given by instrument readings on the Kepler spacecraft, whereas for PDC-MAP correlations with other light curves are used. We will give a brief review of the PDC-LS cotrending algorithm before describing the new PDC-MAP algorithm in Kepler SOC 8.0. As of Kepler SOC 8.0, the old PDC-LS cotrending algorithm is still used to process short-cadence data (due to limitations described below), and a logic branch on the controller level calls the respective PDC version for either long-cadence or short-cadence data. low-frequency trend low-frequency trend (corrected) 1000 2000 3000 4000 5000 time [cadences] 1000 2000 3000 4000 5000 time [cadences] B thermal recovery ast oscillation Aj iji \, llin ■ ArtAWV ^^^^^^ ->-^ ^ / ^^^^ 1000 2000 3000 4000 5000 time [cadences] 2000 3000 4000 5000 time [cadences] D outlier / / : —'^ ■ / / /. / y ^^^/--^ ^^ ^^«/^^Ju~^ .^;^::>-'- , , , discontinuity 3550 3600 3650 3700 3750 3800 3850 time [cadences] 1000 2000 3000 time [cadences] Fig. 1. — Examples for typical errors in PA light curves, which are corrected in PDC. A: long trend (commonly due to DVA). B: thermal recovery, primarily an exponential recovery after an Earth-point, quarterly roll, or safe-mode. The corresponding gap of approx. 50 cadences can be seen right before the recovery. C: fast oscillations, such as from the reaction- wheel desaturation. D: outlier. E: discontinuity (Sudden-Pixel-Sensitivity-Dropoff, SPSD), commonly caused by cosmic rays or solar energetic particles. 2.1.1. Cotrending in PDC-LS: Least squares fitting to ancillary engineering data - Stellar Variability lost Removal of systematic errors in PDC-LS was previously performed based on correlations with a set of ancillary engineering data. These data include the temperatures at the local detector electronics below the CCD array, and polynomials describing the centroid motion of the targets from PA Twicken et al. (2010b)]. This approach is based on the assumption that systematic errors in the light curves can be explained by a combination of instrument effects that are expressed in the engineering data, and the motion of target centroids. A robust least squares fit of each light curve to the design matrix constructed from these time series is performed to identify and remove correlated signatures. For further details of the algorithm, see Ref. Twicken et al. (2010a) PDC-LS did a good job at correcting light curves, and allowed for many of the plethora of new discoveries in planet detection Borucki et al. (2010); Holman et al. (2010); Doyle et al (2011); Fressin et al. (2011); Lissauer et al (2011); Welsh et al. (2012); Borucki et al. (2012)] and stellar astrophysics Meibom et al. (2011); Beck et al (2011); Stello et al (2011); Chaplin et al. (2011)] based on Kepler data. However, it suffered from two deficiencies for a considerable fraction of light curves. First, the biggest problem observed with PDC-LS was overfitting of the data, which led to removal of stellar variability - the corrected light curves of many stars appeared overly flat and featureless. This overfitting was due to the least squares approach to cotrending the light curves. With sufficiently many cotrending basis vectorq^ coincidental correlations of instrument readings with stellar variability happened frequently, and thus the stellar features in the light curves were mistakenly identified as systematic errors and consequently removed. The Bayesian Maximum A Posteriori approach in PDC-MAP is specifically designed to prevent overfitting and in fact solves this problem very well, as will be shown in this work. See Figure [2] for some examples of the input light curve to PDC (the uncorrected fiux output from PA), the (overfitted) output of PDC-LS, and the (significantly improved) output of PDC-MAP. Note that the PDC-LS and PDC-MAP light curves differ slightly in absolute magnitude, because the flux fraction correction (see Section 2.3.11 ) was performed differently in PDC-LS. A second problem of PDC-LS was that high-frequency noise was sometimes injected by the correction, as an unwanted side-effect of reducing the bulk root-mean-square deviation. ^The cotrending basis vectors are a set of time series which describe the systematic errors and are used to remove these systematic errors from the Ught curves by fitting the hght curves against the basis vectors (see below). A self-check was in place to prevent too strong noise-injection, in which case PDC-LS rejected the systematic error correction and instead returned a light curve where systematic trends were not removed, but only other corrections (e.g. discontinuities, outliers, etc) were applied. The fraction of targets which were not cotrended at all was considerable. For instance, of all 162,926 targets observed with Kepler during quarter 7, 25411 (15.6%) light curves were not cotrended. The distribution of non-cotrended targets per channel is displayed in Figure |3] and ranges from 7.7% to 25.4% per channel. Figure |4] shows some examples of light curves that could not be corrected properly with PDC-LS. The problem of noise-injection has been significantly reduced in PDC-MAP, as can be seen in the bottom panels of these examples. 2.1.2. Cotrending in PDC-MAP: Bayesian Maximum A Posteriori fitting to the ensemble of light curves - Stellar Variability regained The new version of PDC, PDC-MAP, uses a different approach in mainly two regards. First, instead of employing ancillary engineering data, cotrending is performed against the ensemble of light curves on the same channel. Secondly, rather than using a simple least squares fit, a Bayesian Maximum A Posteriori (MAP) approach is used for fitting. Note that while this approach does not use any engineering data explicitly, it is still based on the assumption that instrumental signatures are the main cause for systematic errors in the data. Moreover, it assumes that these signatures are highly correlated between targets that are in proximity on the same CCD channel. This second assumption is a critical ingredient for the Bayesian MAP fit, as it allows for the generation of constraints for the fit by using information about systematics in targets within a neighborhood of the target under investigation. It is mainly these constraints that prevent overfitting, and thus help to preserve stellar variability. The MAP algorithm in PDC is described in detail in a companion paper Smith et al. (2012 ) Here we will only provide a brief overview to an extent that is relevant in the context of this paper. Cotrending is performed by fitting each light curve to a set of basis vectors, and con- straining the fit coefficients in a Bayesian Maximum A Posteriori way by using prior infor- mation. The basis vectors are generated from the 50 % most correlated target light curves on the channel, in order to have only the strongest correlated trends in the basis vectors and largely exclude targets with substantial individual fluctuations due to stellar variability. Using Singular Value Decomposition (SVD) on this set of light curves, the N largest singular vectors are used as basis vectors. The number of basis vectors to use can be specified as 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 2.272 ,^2.270- o 3:; 2.268 c 2 2.620 500 ^ 3.075 3.070 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 3.61 ' ' 3.60 1 1 1 ^m^ m0 1 1 1 i£ 4.00 500 500 1000 1500 2000 2500 3000 3500 4000 1000 1500 2000 2500 3000 3500 4000 time [cadences] Tl D O I [— O) Fig. 2. — Examples of light curves where the old PDC-LS overfitted the systematic trends and removed stellar variability. In each of the three panels (1-3), top is the input to PDC (output from PA), middle is PDC-LS (pre Kepler SOC 8.0) using least squares cotrending, bottom is PDC-MAP using maximum a posteriori cotrending. Example 3 also shows incomplete cotrending of smaller scale feature in PDC-LS, as can be seen on the residual 3-day reaction wheel cycle signature (see Fig. [ip). 10 fraction of targets with uncorrected systematics in PDC-LS 25 0.10 0.15 0.20 fraction of targets Fig. 3. — Fraction of targets per channel for which PDC-LS with least squares fitting to ancillary engineering data failed (quarter 7 data). an input parameter to PDC, with eight basis vectors being the defaulljj By construction, these N basis vectors represent the majority of the systematic errors contained in the light curves. Each of the light curves is projected onto these N basis vectors using a least squares fit. To make this fit more robust and prevent overfitting, however, the fit is constrained using prior information from targets which are expected to be affected by similar systematic errors. For that purpose, a target neighbor list is generated for each target with respect to their distance in a 3D space spanned by stellar magnitude (Kp), Right Ascension (RA), and Declination (Dec). The latter two coordinates, RA and Dec, directly translate into the spatial proximity of two targets on the CCD. Further, Kp is used as an additional dimension because systematic errors are not only correlated with location on the CCD but also with signal amplitude. The contribution of each basis vector j to each light curve i is described in the fit coefficient 9*. However, instead of simply using the fit coefficient of the least squares fit, the most likely coefficient given the observations in the neighborhood of the target is picked. In Bayesian terminology, the posterior probability density function (PDF) p{Q\y) is maximized, given a prior PDF p{Q) generated from fit coefficients of all neighboring targets, weighted by their distance to target i: p{Q\y) p{y\Q)-p{Q) p{y\Q)-pm p{y) J p{y\e)pie)de T. ^One potential future improvement for PDC is to calculate the optimum number based on properties of the singular values and vectors, and potentially iterating using the Goodness Metric (see below). 11 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 CO 2 2.06 r 500 \ I \ \ \ I \ 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 time [cadences] Fig. 4. — Examples of light curves where PDC-LS did not perform cotrending because too much noise was injected in the correction. Instead, the uncorrected flux from PA was returned, with only non-trend corrections performed, such as discontinuities, outliers, and flux magnitude. 12 The B that maximizes p{Q\y) is used as the fit coefficient for the MAP fit. Note that the denominator p{y) is simply a normalization over all possible observations y and can in practice be omitted in the maximization process. Taking the maximum a posteriori B effectively constrains the fit and helps to avoid overfitting. In particular, only light curve features that are also present in targets in the neighborhood will be removed, whereas any stellar variability that might have coincidental correlation with some of the basis vectors will be constrained by the prior, and hence be preserved. Figure [5] shows the first three basis vectors for each channel on the CCD for quarter 7. This illustrates how the dominant systematic errors vary across the field of view of the CCD. However, some trends are observed on almost all channels. 2.2. Inputs to PDC The inputs to PDC are the calibrated, uncorrected flux time series (measured in pho- toelectrons), as prepared by PA. One unit of work for PDC is a single module output ("channel") from the CCD with the duration of three months, with a sampling period of 29.4 minutes. Thus, there are typically 1000-3500 targets per channel (see Fig.|6]), and about 4500 data points ("cadences") per target]^ In addition to the flux time series, the target data structure also contains per-cadence flux uncertainty values and gap indicators to de- note invalid cadences, and per-target data such as the flux fraction and crowding metric (see below), as well as bookkeeping information such as the Kepler ID and KIC^*^ data for each target. 2.3. Overview and Data Flow The data flow in PDC- MAP is displayed in Figure [7| Figures |8] and |9] show two light curve examples as they are being processed throgh PDC. The individual operations of this sequential processing are described below: ^This is for "fong cadence" data, which is addressed here. "Short cadence" data is sampled at a rate of 57.8 seconds for the duration of one month, and there are only 512 targets in total on all 84 module outputs. ^'^The Kepler Input-Catalog (KIC) is the primary source of information about targets observed with Kepler, and contains their RA/Dec coordinates, estimates for a star's radius, temperature, surface gravity, and other data. URL: http://archive.stsci.edu/kepler/kepler_fov/search.php 13 Module 1 m^ Modules n ^/Vx \^ z^ l^ l~^ '\kj^ w \K %i '/y\\. f%. S^ ^ Module 21 yf \ ^ 1% ^ ViA ' v.3(" Module 25 Fig. 5. — First three cotrending basis vectors for each channel on the Kepler CCD, data from quarter 7. Note that the four channels of Module 3 are missing because that module failed during quarter 4. 14 number of targets per channel ^ H 1 m r - -i LI 1 3500 3000 2500 2000 1500 1000 500 Fig. 6. — Number of targets per channel in quarter 7. Each of the 25 modules is divided into four module outputs (channels). Modules 1, 5, 21, and 25 are used for pointing and calibration purposes and not for collection of science data. Module 3 failed in quarter 4, leaving a total of 80 channels. There are significantly more targets in the upper left corner because that section of the Kepler field of view is closer to the galactic plane. 2.3.1. I. Gap Data Anomalies The first step is to flag anomalies in the flux series and mark the respective cadences as not usable ("gapped"). Anomalies may affect only one target (e.g. Undershoot Columns Brysan et al. (2010)]) or all targets on the channel (e.g. monthly Earth Point downlinks, attitude adjustments, spacecraft Safe-Modes, loss of fine-pointing, and Argabrightenings Witteborn et al. (2011)]), and can have a duration of one to several hundred cadences. The informa- tion about the gaps and the type of anomaly is provided separately by PA. Gapped data is labeled as such and is not exported to the MAST. 15 PDC-MAP Architecture & Dataflow Vie I. Gap Data Anomalies II. Linearly Fill Gaps -y{cill)- w^lall} I III. Coarse MAP"~| -n residuals , ^^^^^^^^^^T , U ri n X ; I IV. SPSDs T— n X ; V. Harmonics Discontinuities, Harmonics, Outliers 3r VI. Outliers 11} J indices of clean targets, and O] Select only clean targets (no SPSDs) with outliers removed y{clean} r D, H, O, *- y-D-H-Oi IVIain IVIAP Final Ukp{e) -y- u,p(e) I B =p(y\0)p(«) -OT) +Wp,lnp{9)l y-U0-H-D-Oi time series D = Discontinuities /iV, H = Harmonics | ^ /!; sinuiit + Bi cosuiit Outliers y-D-H -Oi Oi y Ue = MAP fit 1 I optional operation (input parameter) n X iterated n-times (input parameter) [I Finalization || ^ VIII. Re-till gaps ► IX. Restore Harmonics 1 Propagate Outliers ttirougti PDC ■* X. Flux Fraction and Crowding Metric Correction \ XI. Goodness Metric ► Output Preparation ^ . ^^ ^ / Archive 1 TPS 1 ^ Fig. 7.— Architecture and Data Flow of PDC-MAP as of Kepler SOC 8.0. 2.3.2. II. Linearly Fill Gaps Even though gapped data is not being exported, some operations in PDC require con- tiguous data over the whole time series. Therefore, gaps are filled by linear interpolation for internal use. Note that these linearly interpolated gap values are filled a second time at the end of PDC, using a more sophisticated algorithm (see below) to prepare the light curves for TPS. 2.3.3. III. Coarse Cotrending A first coarse cotrending is performed, using the Bayesian Maximum A Posteriori (MAP) approach described above (2.1.2). This coarse cotrending is only applied temporarily to the fiux series to enhance the performance of the next three steps (IV, V, VI), and the coarse correction is removed afterwards in order to perform the proper MAP correction (step VII). 16 2.3.4- IV. Sudden Pixel Sensitivity Dropoff (SPSD) Correction Approximately 3 % of the light curves contain one or more noticeable downward step discontinuities per quarter. The vast majority of these discontinuities are the result of cosmic rays or solar energetic particles striking the photometer and causing permanent local changes in pixel sensitivity, although partial exponential recovery is often observed Jenkins et al. (2010b)]. Depending on the incident angle and the energy of the cosmic ray, up to a few pixels can be affected, and the net decrease in quantum efficiency is typically ~0.5 % for a light curve. Since these SPSDs are not correlated systematics, they can not be removed by the MAP algorithm (see step VII), and instead a specific submodule exists to detect and correct SPSDs. The Discontinuity Correction algorithm used in PDC-LS had a considerable false-negative rate, and was moreover performing a rather crude correction which did not take partial recoveries into account. The new SPSD module in PDC-MAP is significantly more sophisticated, and performs markedly better at detecting and correcting SPSDs. The details of this new algorithm are described in a separate publication Kolodziejczak et al. (2012) In brief, SPSD detection is performed by fitting a window of the data to a combination of basis functions, with one of the basis functions modeling a step discontinuity in the fiux. The fit coefficient of this basis function is evaluated in the context of fiux variability of the light curve ensemble on that channel, to determine the probability for an underlying SPSD. Further checks are performed to prevent false positive detection of transit-like features as discontinuities. Correction then proceeds as a two-stage process, with the first stage estimating the step size (based on analysis of the entire fiux series), and the second stage modeling the recovery process. The sum of these two components is the total correction applied to the light curve. In the Quarter 9 data, SPSDs were detected in 5252 out of 167404 targets (3.1%). Figure 10 shows a histogram of the fraction of targets containing SPSDs. 2.3.5. V. Harmonics Removal (optional) Harmonic content can be identified and removed from the light curves of harmonically variable targets. This step was a mandatory operation in PDC-LS in order to prevent harmonic content from corrupting the least squares fitting in the cotrending routine. In PDC-MAP the maximum a posteriori fitting approach is robust against uncorrelated target- specific harmonics and this step is therefore not performed per default, but can be requested with an input parameter. If harmonics are removed for further processing in PDC they are restored later after cotrending. Both corrected harmonics-free as well as corrected light curves with harmonics are exported. 17 From a theoretical point of view, one might expect that this (previously required) op- eration of Harmonics Removal prior to cotrending would improve the cotrending - simply because it reduces the feature complexity of the light curves. However, we found that it has virtually no effect with the new MAP cotrending method. In a direct comparison of 1075 light curves processed with and without Harmonics Removal, we found no difference at all for 97% of the targets. For the remaining 3%, performing Harmonics Removal be- fore cotrending resulted in weak residual long term trends, which had been identified as very low frequency harmonics by the Harmonics Detector. Thus, by removing them before cotrending, and restoring them afterwards, these long term trends were preserved in the light curves. Without knowing the ground-truth, one obviously can not know with certainty whether these long term trends are actually astrophysical effects (in which case they should be preserved), or systematics errors (in which case they should be removed). However, for most of these cases the low frequency trends seemed to be systematic errors, as judged by comparison with the systematic trends identified for other targets, and therefore performing Harmonics Removal appeared to actually decrease the quality of the error correction. In summary, while Harmonics Removal was a required step in the least squares based cotrending in PDC-LS, it seems that this operation does not improve the error correction (and may even do harm in a few cases) in the new Bayesian approach of PDC-MAP. Therefore, this step is skipped by default. 2.3.6. VI. Outlier Correction Outliers in each flux series are identified based on the mean and standard deviation in a sliding window. The window size and outlier detection threshold are input parameters to PDC Only isolated (single cadence) points above the threshold are considered to be outliers in order to prevent flagging of planet transits or stellar flares as outliers. These outlier values are "corrected" by replacing them with linearly interpolated values. The purpose of this correction is to prevent false triggering of Threshold-Crossing-Events (TCEs) in TPS and also to improve the performance of other operations downstream in PDC. Before export to the MAST, the original outlier values and uncertainties are restored to the corrected light curves and propagated through the PDC correction steps like all regular data points. Input from PA Gaps filled (linearly) 1000 2000 Outliers removed Coarse MAP trend restored liiiiiLkki 1000 2000 3000 4000 time [cadences] 000 2000 3000 time [cadences] Fig. 8. — Example for light curve processing of a quiet star in PDC, from the output of PA and input to PDC (top right), to the output of PDC (bottom right). The numbers above the arrows denote the operations as defined in Section 2.3 to get from the previous state (which is shown again in gray for comparison) to the current state (blue curve). A: Input to PDC. B: Gaps have been linearly filled. C: The coarse MAP trend has been removed, this is a temporary correction to facilitate discontinuity correction and outlier removal. D: Discontinuities have been corrected. E: Outliers have been removed. F: The coarse MAP trend has been restored. G: The final MAP fit has been removed. H: Gaps have been filled using a more sophisticated algorithm, and the absolute flux magnitude has been adjusted by flux fraction and crowding metric corrections. 19 A V Input from PA III Gaps filled (linearly) ^ 10^ Coarse MAP trend removed 2,82 ''-^.^ 2.815 ^\ 2.605 > 2.8 ^-^^"^^^^-^^ 2.79 2.785 "w%_ Discontinuities corrected Outliers removed x 10" Coarse MAP trend restored ZOOO 300C time [cadences] 1000 2000 3000 time [cadences] Fig. 9. — Example for light curve processing of a variable star in PDC, from the output of PA and input to PDC (top right), to the output of PDC (bottom right). The numbers above the arrows denote the operations as defined in Section 2.3 to get from the previous state (which is shown again in gray for comparison) to the current state (blue curve). A: Input to PDC. B: Gaps have been linearly filled. C: The coarse MAP trend has been removed, this is a temporary correction to facilitate discontinuity correction and outlier removal. D: Discontinuities have been corrected. E: Outliers have been removed. F: The coarse MAP trend has been restored. G: The final MAP fit has been removed. H: Gaps have been filled using a more sophisticated algorithm, and the absolute flux magnitude has been adjusted by flux fraction and crowding metric corrections. 15 2 10 0) E 5 20 average number of SPSDs LI 0.01 0.02 0.03 0.04 0.05 average number of SPSDs per target 0.06 Fig. 10. — Fraction of targets per channel for which an SPSD has been detected in Quarter 9. The total number of channels is 80, and not 84, because module number 3 failed in Quarter 4, reducing the number of data channels by four. 2. 3. 7. Iteration of SPSD Correction, Harmonics Removal, Outlier Correction The operations of SPSD Correction (IV), Harmonics Removal (V), and Outlier Cor- rection (VI) are non-orthogonal to each other. In theory, performing them simultaneously rather than sequentially should yield optimal correction, however at the expense of a sub- stantially more complex algorithm that has not yet developed. Iteration of these operations is an approximation to such a joint fitting approach, and an input parameter was included to iterate the operations IV-VI for improved correction performance. In particular, at the time of the design of the architecture it was expected that Harmonics Removal would improve the performance of the SPSD correction Kolodziejczak et al. (2012)] (step IV), and hence that iterating operations IV-VI would be beneficial for the overall correction. However, it was found that the presence of harmonics in the frequency- and amplitude-range that is observed in the Kepler data does not decrease the detection performance of the SPSD mod- ule by more than 1 %, while removing them would come with the disadvantages mentioned above for step V. For these reasons, the operations IV-VI are per default only performed once in PDC-MAP, while leaving the option to perform an iteration in case future design changes should make this desirable. 21 2. 3. 8. VII. Bayesian MAP approach for cotrending to remove systematic errors This operation constitutes the main step for the correction of systematic trends in the hght curves. Each flux time series is cotrended against the ensemble of hght curves in order to remove correlated trends using a Bayesian Maximum A Posteriori approach as described above (2.1.2). Before cotrending, the correction applied by the Coarse Cotrending (step III) is restored. The main cotrending procedure (this step) is the same as the routine performed in the Coarse Cotrending (step III), but with two important differences that make its correction significantly more accurate. First, only "clean" targets, for which no discontinuities had been found (step IV), are considered to generate the basis vectors. Secondly, discontinuities, harmonics (if enabled), and outliers have been removed from the light curves before the fit, which improves the quality of the correction. 2.3.9. VIII. Fill Gaps for TPS Transiting Planet Search (TPS) requires contiguous data points for all cadences in a light curve, and moreover assumes statistical continuity of the wavelet coefficient variances of the flux time series. The simple linear gap filling operation does not meet this requirement. Therefore, a more sophisticated gap filling algorithm is employed to refill the gaps of the corrected light curves before export to TPS. The gap filler internally distinguishes between gaps with "short" and with "long" duration. The boundary for short and long gaps is determined by the gap filling module parameter set, and typically about 125 cadences. Gaps shorter than this threshold are filled with an autoregressive algorithm that estimates sample values in the gaps with a linear prediction based on the correlation in the neighborhood of the gap. For gaps longer than this threshold, samples away from the edges would tend to become white noise samples, and a different algorithm is therefore used to estimate the gap values: Segments on either side of the gap are refiected onto the gap, and a weighted sum of the two segments is taken where the weighting factor for each segment is decreasing linearly with the distance to respective edge of the gap. This procedure preserves statistical continuity of the data, and in particular, the correlation structure. However, some smoothing at small scales occurs due to the averaging of two noisy signals. To compensate for this effect, the variances of the wavelet coefficients of the filled gap are adjusted to match those of the original data in the neighborhood of the gap. Preserving the wavelet coefficient statistics is of particular relevance because TPS uses a wavelet-based approach for transit detection Jenkins et al. (2010c); Jenkins (2002); Tenenbaum et al. (2010). 22 2.3.10. IX. Restore Harmonics (optional) If harmonic content had been removed from the hght curves (optional step V), then it will be added back here, properly cotrended with MAP. Harmonic-free corrected light curves are exported in addition to the original corrected light curves. Per default, Harmonics Removal is not performed in PDC-MAP as of Kepler SOC 8.0 (see step V). 2.3.11. X. Crowding Metric and Flux Fraction Correction When generating flux series from pixel data in PA, the optimal aperture Twicken et al. (2010b)] for each target may include additional light from nearby light sources. If not removed, this excess flux would decrease the apparent planet transit depth and lead to a systematic underestimation of planet radii. The so-called crowding metric for each target is computed in TAD, and reflects which fraction of the flux in the optimal aperture is due to the target itself. It is a scalar value representing the average over the effective date range. Note that using a scalar average for the crowding metric is an approximation since background sources of light may enter or exit the optimal aperture over the course of the three-month data period. Similarly, using a scalar value for the flux fraction is an approximation because the centroid of the point-spread-function of a target can move over the course of a quartei 11 Similar to excess flux leaking into the optimal aperture, a fraction of the point-spread- function (PSF) of the target may not be captured in its optimal aperture. To account for this missing fraction, the flux fraction, computed by PA, is used to normalize the flux. Together, these two corrections can be expressed as: ^ yjt) - (1 - c) ■ median(y(t)) where y{t) is the corrected cotrended flux, y{t) is the cotrended flux, c is the crowding metric, and / is the flux fraction. The median is used here, rather than the mean, to make the normalization robust against mathematical outliers. This is important in Kepler light curves because transits from planets or eclipsing binaries represent mathematical outliers. '^The maximal motion of a target is «0. 6 pixel over 3 months. 23 2.3.12. XI. Goodness Metric Characterizing the performance of systematic trend removal in PDC can be highly subjective which emphasizes the difficulty in the cotrending task. A human can peruse a selection of light curves and identify poorly cotrended examples to rate performance. Obvious trends and features can be evaluated but subtle issues with the performance will likely be overlooked. So, in an effort to minimize the subjective analysis, a numerical Goodness Metric has been developed. There are three critical aspects of the cotrending performance to consider in the metric; 1) Removal of systematic trends, 2) no introduced noise and 3) preservation of stellar variability. A successful cotrend occurs when all three conditions are met, so one cannot rely solely on target-to-target correlation, for example, to determine acceptable performance. Overfitting, and hence flattening of stellar features, results in zero correlation but would not be considered a "good" correction. The Goodness Metric calculates the cotrending performance in each of the three components using the following methods which were developed experimentally to agree with observations of critical features. Removal of systematic trends is evaluated by the target-to-target correlation. If the cor- relation is near zero then in principle no systematics can remair^ The Pearson correlation, C, is computed over the entire module output and the correlation goodness, Gc, is computed as the arithmetic mean of the cube of the absolute value of the correlation between the target under study and all other targets, G'c. = «c^$^(|a,f) , (3) where i refers to the target under study and j is summed over all other targets. N is the total number of targets. The purpose of the cube is to over-emphasize any strong correlations, ac is a scaling factor. Introduced noise is determined by examining the change in the Power Spectrum Density of the noise floor for each light curve before and after the correction, ATDCT-i^-f ^ PSD(Nfafter) /,x ^'^'°'^'^«' = PSD(Nf«„,)- '^' Ga,., = oa, / log(APSD(/re5)) dfreq, (5) JAPSD>1 where Nf is the noise floor of the light curve, here defined as the first differences between adjacent fiux values. The noise fioor is used because removal of trends can result in large ^^The root sum-square mean of off-diagonal correlations of white Gaussian noise is exactly l/iViVsampie so even with no systematics the off-diagonal correlation matrix will not be exactly zero. 24 changes in the power spectrum density, but signals close to the Nyquist frequency are dom- inated by noise. The integral is also only taken about regions where the change in power is greater than one, implying an increase in noise. The final component, preservation of stellar variability, is the most difficult of the three to numerate. An exact determination would require knowledge of the intrinsic signal. Since it is not possible to make this determination, an estimate is made by assessing a mid-frequency band region where light curve signals are dominated by stellar features. High frequency components are removed via a Savitzky-Golay Filter and the low frequency via a 3rd order polynomial removal. What is left is almost always overwhelmingly dominated by stellar signals ^^ The exception being the Earth-point recovery regions which can be quite strong, so these regions are masked. The standard deviation of the difference between the resultant mid-frequency hght curves before and after the correction, ^before and Rafter respectively, is then computed, Gv,i = 0;v Std (yj,after " ViMioref \/Vi , (6) where V is the variability as calculated by PDC-MAP and is used to emphasize targets with greater variability. The above three components increase in value (in the range of [0,oo) ) as the performance becomes poorer. But we wish for the "goodness" to vary between and 1, 1 being perfect goodness, and so each component is inverted, Gk,i = -Fi r-f • (7) The total goodness is then the geometric mean of the three components. ^* ~ \l^'c,fi'N,fi'v,i ■ (8) The three weighting parameters are used to adjust the relative emphasis of the three com- ponents. They have been empirically tuned to agree with observations of cotrending perfor- mance, with the weighting parameters being ac = 12.0, un = l.^E — 4, and ay = 1.0£'4. The above computed Goodness Metric has been experimentally developed. It is used when reviewing data products and as an aid to data users to give an estimate of how well to trust the cotrending performance. For the future it is planned to incorporate the Goodness Metric as a Lagrange multiplier in the MAP Posterior PDF and to iteratively improve the MAP fit. 13 As the residual is typically only very poorly correlated. 25 2.4. Outputs of PDC The output of PDC is a data structure containing the corrected hght curves. If Harmon- ics Removal was performed (step V), a second set of corrected hght curves without harmonic content is exported as weU. In addition to these flux time series PDC outputs data processing details such as the target and cadence indices of all SPSDs that were identifled, the location of outliers, and diagnostic information such as the Bayesian prior weighting for each target. Of these data, TPS uses the corrected flux time series with fllled gaps. The corrected flux times series, cotrending basis vectors, and additional processing data such as the Goodness Metric for each target are exported to the MAST. 2.5. Processing Times The unit of work for PDC (long-cadence) is one channel for one quarter, so typically 1000-3500 targets with ?«4500 data points each. Processing of Kepler data is performed on the Kepler SOC computer clusters at NASA Ames Research Center and on the Pleiades supercomputer. The relative processing times of the different PDC components is displayed in Figure [TT] ^ and shows that the MAP-flt and the flnal gap flUing are the computationally most expensive operations. Within MAP (Fig. [lib), the maximization of the posterior PDF consumes the most time. Processing of a typical module output with 1000-3500 targets with PDC-MAP takes between 20 and 30 hours on a modern 3 GHz CPU with four cores. 3. Summary and Conclusions We have presented the new (SOC 8.0) Presearch Data Conditioning (PDC) module of the Kepler Science Processing pipeline. The task of the PDC module is to correct the differ- ent systematic errors in the Kepler light curves that otherwise occlude planet transits and other astrophysical signatures. Errors that are corrected by PDC include systematic trends in the data (usually resulting from difference velocity aberration, temperature drift, and point- ing errors due to vibrations), flux discontinuities (caused by cosmic rays or solar energetic particles), outliers, and scalar corrections to account for the flnite point-spread- function of the targets. Using a Bayesian Maximum A Posteriori approach to remove correlated trends from the light curves, the new PDC (PDC-MAP) has a signiflcantly improved performance as compared with the previous version (PDC-LS) that used a least squares fltting approach with ancillary engineering data Twicken et al. (2010a)]. In particular, PDC-MAP is able to remove systematic errors but at the same time preserve the stellar variability. 26 Coarse MAP Linear Gap filling Gap filling Outliers MAP Maximizer Posterior Conditional ,„, Prior sl!?5 Robust fit SVD -^^^ /3% Compilation Fig. 11. — Fractions of total processing time for PDC (panel A), and for the MAP part of PDC (panel B). Despite the dramatic improvements in data correction of this new PDC module, sev- eral further improvements are conceivable and will be implemented in the future. Foremost, the generation of basis vectors for the MAP cotrending could potentially be improved by performing band-splitting on the light curves and thus operating MAP on different scales independently. Further, the MAP cotrending algorithm has several free parameters (e.g. the number of basis vectors) that are given as input parameters. Ideally, these parameters would be chosen automatically, for instance by optimizing the Goodness Metric. Another po- tential improvement would be to have time-dependent corrections for the flux-fraction and the crowding-metric, which in the current version are approximated as scalar corrections for each quarter. Finally, due to limitations described above, PDC-MAP is currently only used for Kepler long-cadence light curves, whereas the old PDC-LS is still being used for short-cadence light curves - although short-cadence data, with its high relevance for aster- oseismological studies amongst others, could greatly benefit from the superior performance of PDC-MAP as well. Even though some limitations inherent to short-cadence data (such as the sparsity of targets per channel, prohibiting the generation of proper priors in MAP) render it impossible to directly apply PDC-MAP, as is, to short-cadence data, several options 27 exist to partly circumvent these limitations and extend the applicability of PDC-MAP to short-cadence data, such as using interpolated basis vectors from long cadence processing to perform removal of short-cadence systematic trends. Funding for this Discovery Mission is provided by NASA's Science Mission Directorate. We thank the thousands of people whose efforts made Kepler's grand voyage of discovery possible. We especially thank the Kepler SOC staff who helped design, build, and operate the Kepler Science Porcessing Pipeline for putting their hearts into this endeavor. Facilities: Kepler REFERENCES Beck, P. G., Bedding, T. R., Mosser, B., et al. 2011, Science, 332, 205 Borucki, W. J., Koch, D., Basri, G., et al. 2010, Science, 327, 977 Borucki, W. J., Koch, D. G., Batalha, N., et al. 2012, The Astrophysical Journal, 745, 120 Bryson, S. T., Jenkins, J. M., Klaus, T. C., et al. 2010, Proc. SPIE, 7740, 77401D Chaplin, W. J., Kjeldsen, H., Christensen-Dalsgaard, J., et al. 2011, Science, 332, 213 Charbonneau, D., Brown, T. M., Latham, D. W., & Mayor, M. 2000, The Astrophysical Journal, 529, L45 Deeg, H. J., Doyle, L. R., Kozhevnikov, V. P., et al. 1998, Astronomy & Astrophysics, 339, 479 DeGroot, M. 1970, Optimal Statistical Decisions (Wilcy-Interscicncc) Doyle, L. R., Carter, J. A., Fabrycky, D. C, et al. 2011, Science, 333, 1602 Fressin, F., Torres, G., Rowe, J., et al. 2011, Nature Haas, M. R., Batalha, N. M., Bryson, S. T., et al. 2010, Astrophysical Journal, 713, L115 Holman, M. J., Fabrycky, D. C, Ragozzine, D., et al. 2010, Science, 330, 51 Jenkins, J. M. 2002, Astrophysical Journal, 575, 493 Jenkins, J. M., Smith, J. C, Tenenbaum, P., & Cleve, J. D. T. J. V. 2011, Advances in Machine Learning and Data Mining for Astronomy, ed. M. Way, J. Scargle, K. Ali, & A. Srivastava (Chapman and Hall, CRC Press) Jenkins, J. M., Caldwell, D. A., Chandrasekaran, H., et al. 2010a, Astrophysical Journal, 713, L120 Jenkins, J. M., Caldwell, D. A., Chandrasekaran, H., et al. 2010b, Astrophysical Journal, 713, L87 Jenkins, J. M., Chandrasekarana, H., McCauliff, S. D., et al. 2010c, in SPIE Conference Series, Vol. 7740 Kaltenegger, L., & Sass, D. 2011, The Astrophysical Journal Letters, 736, L25 Kasting, J. F., Whitmire, D. P., & Reynolds, R. T. 1993, Icarus, 101, 108 Koch, D. G., Borucki, W. J., Basri, G., et al. 2010, Astrophysical Journal Letters, 713, L79 28 Kolodzicjczak, J., ct al. 2012, (to be submitted to PASP) Lissaucr, J., Fabrycky, D., Ford, E., ct al. 2011, Nature, 470, 53 Meibom, S., Barnes, S. A., Latham, D. W., et al. 2011, The Astrophysical Journal Letters, 733, L9 Middour, C, Klaus, T. C, Jenkins, J., et al. 2010, in Proc. SPIE, Vol. 7740, 77401A Quintana, E. V., Jenkins, J. M., Clarke, B. D., et al. 2010, in SPIE Conference Series, Vol. 7740 Smith, J. C, Jenkins, J. M., Morris, R. L., et al. 2012, (submitted to PASP) Stello, D., Meibom, S., Gilliland, R. L., et al. 2011, The Astrophysical Journal, 739, 13 Tenenbaum, P., Bryson, S. T., Chandrasekaran, H., et al. 2010, in SPIE Conference Series, Vol. 7740 Tenenbaum, P., Christiansen, J. L., Jenkins, J. M., et al. 2012, Astrophysical Journal, (accepted) Twickcn, J. D., Chandrasekaran, H., Jenkins, J. M., et al. 2010a, in SPIE Conference Series, Vol. 7740 Twicken, J. D., Clarke, B. D., Bryson, S. T., et al. 2010b, in SPIE Conference Series, Vol. 7740 Welsh, W., Orosz, J., Carter, J., ct al. 2012, Nature Witteborn, F. C, Cleve, J. V., Borucki, W., Argabright, V., & Hascall, P. 2011, Proc. SPIE, 8151, 815117 Wu, H., Twicken, J. D., Tenenbaum, P., et al. 2010, SPIE Conference Series, 7740 This preprint was prepared with the AAS I^Tj^X macros v5.2.