The use of genome-scale data to
infer phylogenetic relationships has surged in recent years due to progress
made in target-gene capture methods and sequencing techniques. Data filtering,
the approach of excluding undesirable data from analyses, presumably could alleviate problems caused by systematic errors in phylogenetic inference. Data filtering approaches are now more relevant than ever as there are often thousands of loci available in genome-scale data, allowing for a more
rigorous data selection scheme. Different data filtering criteria, such as evolutionary rate, base composition, phylogenetic informativeness, stemminess, and molecular clock-likeness as well as others have been proposed for selecting useful phylogenetic markers. Nevertheless, there are still few studies testing all these criteria simultaneously. We tested the separate and joint effects of
those data filtering criteria. We found through carefully examining different characteristics of molecular markers, that calibrated clock-likeness is the best indicator of the phylogenetic usefulness of molecular markers in our study. Stemminess, the ratio of internal branch lengths over the total tree length is not a good indicator of phylogenetic performance. Slow-evolving genes
are also not necessarily strong prior candidates for phylogenetic analysis. Phylogenetic informativeness can be informative when trying to resolve specific nodes in a tree, such as those associated with short internal branches, but may not be a good indicator of the tree-wide usefulness of a locus. Base compositional bias, as indicated by relative composition variability (RCV)
values, may be a useful indicator of problematic phylogenetic markers, particularly by employing a threshold cutoff to exclude outliers with very high RCV values.