Observational studies

A good start is Rosenbaum’s Observation and Experiment: An Introduction to Causal Inference, which is wordy but gives a high-level conceptual introduction to observational study design. Rosenbaum has two other monographs with the mathematical details.

Empirical studies of accuracy

Benson, K., & Hartz, A. J. (2000). A comparison of observational studies and randomized, controlled trials. New England Journal of Medicine, 342(25), 1878–1886. doi:10.1056/nejm200006223422506

Systematically compares the results of medical observational studies to randomized controlled trials of the same intervention, and finds the effect sizes actually match reasonably well (observational studies lie within the 95% CI for the RCTs).
Young, S. S., & Karr, A. (2011). Deming, data and observational studies. Significance, 8(3), 116–120. doi:10.1111/j.1740-9713.2011.00506.x

Suggests, from different surveys of observational studies and RCTs, that observational studies actually have an abysmal success rate; they surveyed 12 RCTs covering 52 observational claims, of which zero replicated. Suggests listening to Deming’s work on quality control: inspection is not the solution; structural and management problems must be fixed. Identifies “Multiple testing, bias, and multiple modelling” as core problems.

Techniques

Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Statistics in Medicine, 26(1), 20–36. doi:10.1002/sim.2739

Donald Rubin’s perspective on designing observational studies, using similar techniques as one might use to design a randomized controlled trial. By keeping the principles of experimental design in mind, he argues, and by designing the study before we have access to any results, we can produce reasonably reliable observational studies.
Lipsitch, M., Tchetgen Tchetgen, E., & Cohen, T. (2010). Negative controls. Epidemiology, 21(3), 383–388. doi:10.1097/ede.0b013e3181d61eeb

Suggests applying the technique of negative controls from experimental science to observational studies. For example, a study of whether influenza vaccines reduce pneumonia or influenza hospitalization in the elderly produced an improbably large effect; when the same protective effect was found for hospitalization due to injury or trauma, it became clear there must be an unmeasured confounder.

A good negative control “cannot involve the hypothesized causal mechanism but is very likely to involve the same sources of bias that may have been present in the original association.”

There are negative control outcomes, like hospitalization above, and negative control exposures, which in the above example would involve testing whether a completely unrelated vaccine also shows a protective effect for influenza, for example.
Schuemie, M. J., Ryan, P. B., DuMouchel, W., Suchard, M. A., & Madigan, D. (2013). Interpreting observational studies: Why empirical calibration is needed to correct p-values. Statistics in Medicine, 33(2), 209–218. doi:10.1002/sim.5925

Applied negative controls to observational studies of the effects of drugs, finding that biases in the observational studies lead to higher than 50% false positive rates for each study design they tried. They suggest “calibration” of p values, essentially scaling them down based on false positive rates from negative controls, so the right alpha can be maintained; I’m not sure how this addresses the causal issues inherent here, or whether a “calibrated” p value would convince me of causality any more than an uncalibrated one.
Madigan, D., Stang, P. E., Berlin, J. A., Schuemie, M., Overhage, J. M., Suchard, M. A., Dumouchel, B., Hartzema, A. G., & Ryan, P. B. (2014). A systematic statistical approach to evaluating evidence from observational studies. Annual Review of Statistics and Its Application, 1(1), 11–39. doi:10.1146/annurev-statistics-022513-115645

Reports on the results of the OMOP, which tested various different observational trial strategies (case control, cohort, many other variations) on several large medical databases, to see which methods were susceptible to bias, which produced good confidence interval coverage, and so on, using positive and negative controls to characterize error rates.