12 minute read

Collections Accessibility

By Francesco Ramigni (Manager, Software Development, ACMI, Australia’s Museum of Screen Culture)

Libraries, Archives, Museums (LAM) have always found it challenging to improve the accessibility of their collections. With physical objects, institutions lack the physical space for public exposure. As digitisation technology became more affordable, and the Internet — widespread, there was hope that the accessibility issue will be resolved once and for all. Unfortunately, to quote Erik Salvaggio, we have now moved from the Age of Information to the Age of Noise.1 We are unable to find text, sounds, images or videos, all confused in a gigantic sea of data. If we are lucky to find them, that information is almost immediately lost again, overwhelmed by incessant streaming of new digital input.

The incredibly rapid progress of neural networks and machine learning technologies has now opened the possibility to first collect and scan, and then tag, classify, categorize and cluster not only textual documentation, but also images (and consequently videos, which are collections of frames).

The newest AI tools (in particular the Diffusion Model) — some of them even open sourced — seem able to recognize objects in images, with such accuracy that makes them worth being applied in the LAM sector. What’s more, once combined with Large Language Models, they can even describe whole scenes, actions, historical or environmental situations or photographic or artistic styles.

The “machine” can now help LAM institutions whose cataloguing tasks that have become humanly impossible. Applications have already been successfully developed by some archives and museums, like Australia’s national museum of screen culture, ACMI, and the National Film & Sound Archive, NFSA.

If embeddings and training on data sets (both words or pixel) allow us to retrieve images and videos much more efficiently, their generation with AI tools is the next immediate step. If the machine can “understand” what humans see in every single image, it can create a new image once we instruct it what we want to see.

Figure 1: Australian Museum Collection web access (https://biocache.ala.org.au/occurrence/search?q=institution_uid:in4#tab_recordsView)

Because it cannot draw, it then reverses the image reading process, morphing together data from the billions of images that have been used for its training. The machine can do that in such a sophisticated way, that text-to-video generative AI is triggering a “wow” reaction anytime we watch its “creations.”

This may generate another legitimate need for all libraries and museums visitors: in the future they may expect to access a visual representation of the archives’ and collections’ objects, instead of just scrolling through dense lists of database records (see figure 1). So would it be worth generating and publishing an image or (better yet) a short video clip to describe and contextualise any item in a collection? How much would that benefit the visitors of the institution and the cultural sector more broadly?

While the institutions will never be able to afford all the curators’ and media producers’ time for visually augmenting all the collection records, Generative AI may be able to automatically and quickly do the job, provided the output makes sense and is properly contextualised.

This will not happen if we just run an application that feeds a Large Language Model with the record metadata. It is a challenge very similar to what has already been experimented in LLM Chats, where there is a lack of reasoning abilities and the answers can be polluted with hallucinations, digressions and inconsistency, if relying only on the general training datasets and the inputted prompt.

Many are trying to solve the issue with LLM augmentation: usually fine-tuning or RAG (retrieval augmented generation), but more techniques, or combination of them, are surfacing every week.

We might want then to apply the same contextualising techniques to record-to-video generation.

ACMI, Australia’s museum of screen culture, in Melbourne, Australia is an interesting study case.

They have a vast collection of items (donated, acquired, or inhouse produced), that covers a variety of screen culture genres: from home videos to videogames, from projection equipment to TV ads, from feature films to historical documentaries. Their main challenge is to ensure online accessibility to those objects, for an audience of ACMI visitors, researchers, and, in general, screen culture aficionados.

ACMI has been quite innovative in exploring and implementing AI technology for improving the searchability of their collections. For instance, they have been audio captioning all the published videos using Whisper2 and then search-indexing the text for quick retrieval. Then they have augmented the Video Search feature with an object-detection tool like BLIP 2, which is also coupled to a Large Language Model for captioning the frames.3 Finally, they are also vectorialising all collection metadata and using the neural algorithm for discovering peculiar connections among the collection items.4

However, the ACMI Collection also includes undigitized video material. These items are still retrievable via the website, but they are not accompanied by any images, trailers, teasers or any kind of visual description.

I have chosen a couple of examples, because I was interested to see if a free generative AI tool with all default settings, and prompting only the metadata associated with the records, can give us a glimpse of the potentiality in using this technology for visualising collection items with minimum effort.

Later, I will examine how we can contextualise those examples, augmenting the generative AI processing.

The first item is retrievable from the ACMI website5 and it’s a 1937 U.S. educational film about the building construction craft and industry. Only a descriptive text and some metadata are published, and there is a note saying that, unfortunately, ACMI doesn’t have any image or video for this film (which is 16mm, not digitised).

For my experiment, I have chosen the Gen-2-Runaway multimodal system,6 because of its popularity, friendly interface and processing speed.

As a first step, I generated few images, just prompting the content from the ACMI website item page. As everyone knows, “a picture is worth a thousand words” and this is true also for an AI generated image. But the image must be right on, it must really summarise all in one shot: what is it about, why the object is there, how did it get there, how is it connected, how it affects our cultures and environments. This is a very complicated task for an AI that can only reconstruct an image starting with clusters of pixels associated with words or groups of words.

I’ve added below one of the most interesting images, where I find some appealing details (the black&white granularity of a 1930s film, the row of dwellings with reference to the construction industry, the tracks suggesting the role of transportation). However, there are clearly some distortions, if not even hallucinations, and the general feeling of a lack of context.

Figure 2: Generated by Gen 2 – Runway, prompting the content of https://www.acmi.net.au/works/72759--shelter-usa/

I used the same prompt for generating a very short video on the same Gen 2 – runway platform. The video may be more entertaining and attractive than a single image (and more suitable for a “film” item). A video, which can be also associated with audio and captions, can narrate a “story.” Because it is a correlated sequence of a much greater number of images, it is capable of describing the object from multiple perspectives: its history, location, usage, technical details, relation to human society, expanding even to visit or access information. A lot that can be communicated in a 10 second video clip.

In this case, however, the video example that I’ve generated carries the same flaws that we have noticed in the image: the historical period is definitely there, a house and a car as well, but the meaning of the scene is unknown.

Figure 3: Generated by Gen 2 – Runway, prompting the content of https://www.acmi.net.au/works/72759--shelter-usa/

It’s exciting to note that, both for images and videos, although with the limits of an uncontextualized prompt, there are already elements showing the potentiality of an AI generated visualisation, without the need of any post-production manual editing.

The second example is another U.S. educational film,7 this time about immigration, and again I generated a few initial images (below an example)

Figure 4: Generated by Gen 2 – Runway, prompting the content of https://www.acmi.net.au/works/68784--immigration/ and then a short video.
Figure 5: Generated by Gen 2 – Runway, prompting the content of https://www.acmi.net.au/works/68784--immigration/

There are some positive clues here: the post-war atmosphere, people’s clothes, crowds gathering or queuing or walking single file to work, even a glimpse of the film colouring at that time. It means there are, although vague, references to year, location, topic and media type.

But again, looking at the video, apart from the typical text to-video AI distortions, the impression is that many frames are repetitive and consequently wasted. Also, because it is a single uninterrupted cinema-like scene, it might mislead the user and makes them think it is an excerpt from the movie itself.

Considering that the prompting I’ve been using was very basic, someone might even say “brutal,” there is certainly the possibility to improve the generated output with some additional contextualising information.

The first step may be the injection of a “system message.” In the study case of the ACMI museum, this could be something like:

“I am a film museum visitor. I am exploring the museum’s collection and I have found the film < title> made in the year , in with this description . Make a very short video that describes this collection item, in the cultural context of its time and location.”

This system message can be coded and automatically applied to all films in the ACMI Collection.

The output may be still too generic, and we would like to give extra context, more specific to the institution where the item is preserved and catalogued. For instance, a number of documents that describe the strategic vision of the museum, the kind of exhibitions hosted by the museum, the kind of audience that visits the museum, or what kind of researchers are browsing the collections, maybe policies and regulatory guidelines for museum video production, among other things. All this documentation would be indexed and vectorialized, then scanned by search engines, that would in turn augment the prompting and increase the AI generator rational ability (like a RAG system8).

We could also apply some data training, if we have numerous collection items already equipped with proprietary, curated images or thumbnails. This would be the more resource intensive augmentation work, due to the high amount of labour and GPU hardware requested by fine-tuning any kind of generative AI. It would probably depend on the scale of the training data sets, the level of parametrization, the number of iterations, and so on.

Finally, we could force some video editing, with specific additional instructions: for example, “add initial 5 seconds with the description of the entire collection” or “add a tail of 5 seconds with the museum time visitations (or a list of the current exhibitions).”

Once set up, the platform would automatically process all records, potentially thousands or millions of them on ingest.

Below you can see a rough diagram of what the solution in its entirety would be:

That would force institutions to adopt mitigation actions, which may include:

• Using generative AI platforms trained exclusively on known, approved and authorised data sets

• Augmenting the prompting with safeguards against biases, ethical breaches, confrontational themes or unwanted topics

• Implementing automatic quality assurance tools, preventing the publication of AI hallucinations or digressions

We can see how many moving parts are still necessary for building a reliable, consistent and safe generative AI application. This makes the operationalisation of these kind of solutions even more challenging. Once the experimentation phase is concluded, and a prototype has been evaluated, tested, and finally approved, then there are numerous issues that operations managers need to tackle.

This augmented text-to-video generation would not be exempted from the well-known challenges in adopting AI. Cultural institutions, especially in the GLAM sector, are sometimes questioned for the choices made by their curators. The issue of public trust becomes more complex if the categorisation is decided by a machine, trained on partial, limited (and sometime illicit or morally unacceptable) data sets, with calculations made by a pre-determined, imperfect, algorithm. Copyright infringements, biases and unethical content are just some of the aspects of a more fundamental problem, and they are even more evident when manipulating images and videos.

These challenges will exist in this theoretical idea of record to-video generation as well. There is no guarantee that the video would not contain frames that are questionable at the least, if not totally inappropriate. And not just for ethical and cultural standards, but also due to misalignment with the institution’s mission and values.

Some of them may include:

• Planning a continuous evaluation of the generative AI platforms, adopting the best available tool within the assigned budget;

• Defining image and video parameters based on consumption scenarios (e.g., output devices);

• Ensuring that all the documentation, data sets, templates, etc, are up-to-date and secure;

• Monitoring and evaluating performances and output quality (based on agreed metrics); endnotes on page 34

• Planning and implementing corrective actions in case of complaints or incidents.

It is often said in the LAM sector that operationalising AI applications is the most difficult part and, indeed quite rarely, so far, cultural institutions have made their experiments live and publicly accessible.

With the enormous investment in AI research, we don’t know what we’ll see over the next few months. There is vast potential for AI video generation to improve in every aspect, and the progress in optimising sustainability and compliance with ethics principles will determine its adoption in the cultural sector.

Endnotes

1. Eryk Salvaggio, The Age of Noise, https://cyberneticforests.substack.com/p/the-age-of-noise

2. Simon Loffler, Collection Video Transcription at Scale with Whisper, https://labs.acmi.net.au/collection-video-transcriptionsat-scale-with-whisper-5e8df10467f8.

3. Simon Loffler, Seeing Inside ACMI’s collection – part 2, https://labs.acmi.net.au/video-image-search-94bed300ca22

4. Simon Loffler, Embeddings and our collection, https://labs.acmi.net.au/embeddings-and-our-collection-afa815b8406e

5. ACMI Collection, Shelter (USA), https://www.acmi.net.au/works/72759--shelter-usa/.

6. Runway Research, Gen-2: The Next Step Forward for Generative AI, https://research.runwayml.com/gen2

7. ACMI Collection, Immigration, https://www.acmi.net.au/works/68784--immigration/.

8. Patrick Lewis and Others, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://arxiv.org/abs/2005.11401

This article is from: