An institution with more time, expertise, and resources to expend should pursue a multi-step workflow to identify and address any known problematic formats in its collections.
Institutions with more technical staffing might prefer to use more advanced command-line approaches. Unix programs such as the find and file commands (or similar tools in other OS environments, e.g.) can be used in concert with a shell script to create a per-file list of MIME type values at a top-level or sub-directory level. This list can then be output to a tabular format (e.g., txt, tsv, or csv) for further analysis and format tracking. The institution can store this output file and/or any derivations (e.g., .xls, .txt, .docx, .pdf, etc.) in a sub-folder(s) along with the corresponding directory of analyzed files. Ideally, the directory name and date should be included in the filename(s) of this file(s). If files are added to the collection over time, the commands can be re-run, and a new set of outputs stored. Tools, such as JHOVE/2 or FITS, go a step further to not only identify file formats but validate their conformance to the format. They can also provide report outputs in several tabular formats such as those mentioned above, as well as in XML.
Once the institution has this basic knowledge about the file formats it manages, it can explore and experiment with some of the nascent format registries to determine any sustainability issues that these formats may present (e.g., obsolescence, lack of open standards, backwards compatibility issues, etc.). An institution can conduct this research using the Unified Digital Formats Registry (UDFR), PRONOM, and/or the Library of Congress Sustainability of Formats website. For example, an institution’s analysis of its file formats may reveal several born-digital newspaper files in the HTML 2.0 format. With this information, the institution could then turn to the UDFR and perform a search on HTML 2.0 and return a full format profile, identify its successor format versions (in this case HTML 3.2, 4.0, and XHTML), the applications that were able to output files into this format, and the applications that can successfully render HTML 2.0 documents.
This institution can then set up a test-bed environment for experimenting with migration and normalization tools like Xena. Using subset copies of its born-digital newspaper content the institution could experiment with Xena’s in-built features for converting this legacy HTML content into valid XHTML.
Once the format risk factors and the migration pathways are both known and thoroughly tested, the institution can make a policy decision regarding normalization and/or migration for files stored in this format. Depending upon the policy, the institution may choose to normalize, migrate, or continue to store its current format types.
Downloading, installing, and testing the various utilities and tools mentioned above will require work by technical staff, curators, or consultants with command-line experience. Structuring and making sense of the outputs from such tools will also require some investment of time (and patience).
Finally, performing format migrations requires a larger resource investment than other format management steps. To perform migrations, an institution should ideally set aside a workstation with adequate space, processing capability, and configurations for a proper test-bed environment. The institution will need to determine and test the proper migration tools, a task that will necessarily involve both curators and staff with experience in installing and configuring open-source software. The institution should run sample conversions and perform manual quality checks prior to any batch migrations, and all migrations should be deployed in accordance with institutional policy documentation. Coordination between technicians and curators will be needed throughout the migration process.