Working with files
by Tim Anderson
Navigating the file format maze: Tim Anderson finds some tools that can help.
HardCopy Issue: 60 | Published: May 1, 2013
It is a common scenario: “I’d like the application to save the report as an Excel spreadsheet”, says the client. Another frequent request is for Adobe PDF output, often used for emailing invoices and other documents thanks to its reliable rendering across a broad range of platforms. Web applications may need to generate charts in PNG format, or deliver a dynamically assembled download as a ZIP file.
You may also need to handle files as input rather than output. Someone, or another application, might create an Excel spreadsheet, for example, from which you have to gather structured data to insert into a database. All these tasks require you to understand and work with file formats.
File formats can be both highly complex and poorly documented. If you have ever sat down to write a file format parser or generator from scratch, you will know how difficult this can be. Typically, file formats exist in multiple versions, and in some cases – the old binary Microsoft Office formats, for example – were never officially documented, leaving developers to puzzle them out by reverse engineering. Some formats that are documented, like Rich Text Format, have quirks that you can only discover the hard way, when your output does not work as expected.
Another issue is that what the user perceives as a single document may in fact contain multiple formats, such as documents which contain images. The AVI (Audio Video Interleave) video format is really a specification for a multimedia container which may contain a variety of media formats. By contrast, there are open formats like PNG (Portable Network Graphics) which are well documented and relatively easy to handle.
Fortunately it is rarely necessary to work from scratch. This is one of those cases where developers are better off using an existing library or component. Numerous resources are available, ranging from official SDKs and components to open source and commercial third-party libraries. One caveat though: this is one area where the word ‘supported’ needs to be treated with caution. File formats are complex, and we have all had the experience of importing or converting a document from one format to another, and getting scrambled formatting or missing content. In the case of the most complex formats, few if any libraries cover every last feature perfectly. That said, the most important and commonly used features get the best support, so often there is no need to worry.
What if the format is XML, does that mean everything will work with standard XML libraries? It certainly helps, but note that XML is not so much a file format but rather a standardised way of creating file formats – after all, the X does stand for eXtensible. Both Microsoft’s Open XML and the OASIS OpenDocument (ODF) formats are XML-based, for example, as well as being standardised by ISO, but that does not make them similar or any less complex, and you still need libraries to use them productively.
HTML, the format rendered by Web browsers, is something of a special case. HTML is not an XML language, though it has a common ancestry in SGML (Standard Generalized Markup Language), and a version of HTML was developed that is an XML language, called XHTML. If you need to generate Web pages outside the context of a Web application framework then it may be easier to generate XHTML. If you are parsing HTML – a technique called screen scraping which is ugly and prone to failure if the target site changes – then it can pay to convert it to XHTML first, using a library like the open source HTML Tidy.
If your client presents you with documents in an obsolete format, such as Microsoft Works files with a WKS extension, then it’s more likely to require a one-off conversion than something you need to build into an application. This is one reason for keeping old software available on virtual machines. The open source DOSBox is particularly good for running 16-bit Windows 3.1, for example, and you can install applications like Works, Lotus SmartSuite and other old software (although getting hold of the installation media can be a challenge). A simpler solution is to use a service like Zamzar which lets you upload files in a wide range of formats, including documents, graphics, audio and video, and convert it to an appropriate modern format. Zamzar has both free and paid-for options for its conversion service.
Working with PDF
Adobe’s PDF format benefits from strong developer support, both from Adobe and from third-party libraries. These are some of the libraries available:
Aspose.Pdf has a wide range of features including the ability to create PDF documents, import or export form data, use XML templates, and convert a wide range of document types to PDF, including Word, HTML, SVG, LATEX, PCL and XPS. This is a .NET library with examples for C# and Visual Basic. There is also a version for Java.
AspPDF (ActiveX) and aspPDF.NET are libraries from Persits Software for creating and exchanging PDF documents. The API covers drawing and image handling, signing, table support, form creation and filling, HTML conversion and more.
easyPDF SDK 7.0, from BCL, offers programmatic control of PDF from ASP, ASP.NET, C#, C++ and Java (on Windows). Along with support for creating and manipulating PDF documents and forms, you can convert Microsoft Office documents and print any document to PDF format using the PDF Printer API.
iText is an open source library for Java and C# which can generate PDF documents, fill PDF forms programmatically, add digital signatures and more. It is available under both open source and commercial licenses.
PDF Creator Pilot is packaged as a native code COM DLL which you can use from almost any Windows language, including .NET, Delphi and Visual C++. You can create and manipulate PDFs, use images in various formats, extract text from existing documents, and more.
One advantage of Microsoft’s move from binary Office formats to XML is that the newer formats for Word, Excel and PowerPoint (docx, xlsx, and pptx) are more amenable to programmatic control. If you need to parse or generate Office documents, download the Open XML SDK 2.5 for .NET Framework 4.0, or the earlier SDK 2.0 if you need to run on .NET Framework 3.5.
The SDK includes a handy productivity tool that allows you to open an Open XML document, see its XML structure, validate it, and even reflect the code so that you can see the C# which would generate it. There is also a document compare feature, and reference documentation for the entire SDK. Furthermore Microsoft provides online tutorials with ‘How to’ help on questions like reading values from a spreadsheet, or how to apply a style to a paragraph in a Word document.
You can use the SDK in Visual Studio by adding a reference to the DocumentFormat.OpenXML assembly.
If you need to work with the binary formats, you probably want to use a third-party library such as Aspose.Words for Word or Aspose.Cells for Excel which handle both old and new Office document types. Another option is to automate the Office applications, although this approach is less robust and not suitable for server-side code.
The Aspose libraries also support the rival OpenDocument formats used by Open Office and Libre Office, among others.
There are also open source libraries for OpenDocument, including the ODF Toolkit. Hosted by The Apache Foundation, this is primarily a Java project, though there is a .NET implementation called AODL which lets you create and manipulate text and spreadsheet documents. However it is fair to say that Microsoft has done a better job supporting programmatic use of Open XML, at least if you are working on the .NET platform.
Working with images
Many programming platforms already include an API for generating images. In the .NET Framework, for example, you can use the System.Drawing.Bitmap class to save images in a range of formats including BMP, EMF, GIF, JPEG, PNG and TIFF. You can also easily draw on a bitmap using the Graphics class which encapsulates the Windows GDI+ graphics API.
However there are still cases where you want the services of a third-party library. Aspose.Imaging, for .NET and Java, covers common image types. Aspose.OCR, also for .NET and Java, makes it easy to read text from images, including font and style information if you need it. You might use it in conjunction with the Aspose document libraries to convert scanned documents into Microsoft Office format, for example.
Another key vendor for imaging is Leadtools which has imaging SDKs for .NET, Windows native code, Windows Runtime, HTML5, iOS, OS X, Android and Linux. The areas covered include loading, saving and converting, processing, capture and scanning, image display, and printing.
Microsoft’s content management platform SharePoint runs on .NET, and in principle you can use any server-side .NET library as part of a SharePoint application. There are also add-ons specifically for SharePoint that fully support file types which SharePoint does not support natively. Until the 2013 release, SharePoint did not properly recognise or index PDF documents, for example. On earlier versions, you need to add a search filter from Adobe for this to work.
Aspose has a range of SharePoint add-ons based on its document conversion and manipulation libraries. Aspose components for SharePoint lets users convert documents from one type to another, or with Aspose.PDF, export lists and wiki pages to PDF.