Most data sets are generated using the guidelines that are specified by the tools that the data is built for. Audio and Video have to be stored in a specific way to be supported by media players, sensoric data is formated in a certain way to be interpretable by the surrounding workflows and even if no format is used, the text is really just transporting information based on a small set of rules. For text data, this is not the case. The text itself is the information and the complexity of the information that is described is theoretically unlimited. This results in a huge heterogenity of text formats because the tools that are used to create the files do not limit their content as long as it does not encode letters or other characters in a unknown way.
While cases exist in which PDFs that are generated from scanned JPG files are considered as digital documents, most digitized and edited documents from established text corpus projects can be used as plain text. Anything else - like for instance .odt, .doc or .pdf - is considered as not usable by at least me, which solves this variety problem in my working environment.