Skip to main content

Collection Page Metadata

This guide shows you how to extract metadata from a single archived collection page, including description, image alt text etc.

Written by Bjorn Forsberg
Updated today

This guide shows how to extract any available metadata from a single archived collection page, including:

  • title

  • description

  • original collection URL

  • image URL

  • image alt text

  • collection handle

  • archived page URL

You do not need any coding experience. You only need to open the archived collection page, paste a script into your browser console, and copy the result.


Before you start

Make sure you already have the archived collection page open in your browser.

This works best on pages opened from the Wayback Machine or another archived version of a collection page. See this article if you have not generated the collection list file yet.

Step 1: Open the archived collection page

Open the archived collection page in your browser. You should be on the exact page you want to extract data from.

Step 2: Open your browser console

Choose the instructions for your browser.

Chrome

  • Windows: press Ctrl + Shift + J

  • Mac: press Cmd + Option + J

Safari

  • First enable the Develop menu in Safari settings if needed

  • Then press Cmd + Option + C

Firefox

  • Windows: press Ctrl + Shift + K

  • Mac: press Cmd + Option + K

A panel will open, usually at the bottom or side of the browser window.

Step 3: Paste the script into the console

Copy the full script below.

Note: A description of what the script checks can be found further down in this guide.

(() => {
const headers = [
'title',
'description',
'url',
'image src url',
'image alt text',
'handle',
'url to the archived version of the collection page'
];

const getMeta = (selector) =>
document.querySelector(selector)?.getAttribute('content')?.trim() || '';

const getAbsUrl = (value) => {
if (!value) return '';
try {
return new URL(value, location.href).href;
} catch {
return value;
}
};

const getJsonLdNodes = () => {
return [...document.querySelectorAll('script[type="application/ld+json"]')]
.flatMap((el) => {
try {
const json = JSON.parse(el.textContent.trim());

const flatten = (obj) => {
if (!obj) return [];
if (Array.isArray(obj)) return obj.flatMap(flatten);
if (obj['@graph']) return flatten(obj['@graph']);
return [obj];
};

return flatten(json);
} catch {
return [];
}
});
};

const nodes = getJsonLdNodes();

const bestNode =
nodes.find(x => String(x.url || x['@id'] || '').includes('/collections/')) ||
nodes.find(x => String(x['@type'] || '').match(/CollectionPage|WebPage/i)) ||
{};

const ldUrl = getAbsUrl(bestNode.url || bestNode['@id'] || '');
const ogUrl = getAbsUrl(getMeta('meta[property="og:url"]'));
const canonicalUrl = document.querySelector('link[rel="canonical"]')?.href || '';

const imageFromLd = (() => {
const img = bestNode.image;

if (typeof img === 'string') return img;
if (Array.isArray(img) && typeof img[0] === 'string') return img[0];
if (Array.isArray(img) && img[0]?.url) return img[0].url;
if (img?.url) return img.url;

return '';
})();

const imageAltFromLd = (() => {
const img = bestNode.image;

if (Array.isArray(img) && typeof img[0] === 'object') {
return img[0].caption || img[0].name || img[0].description || '';
}

if (img && typeof img === 'object') {
return img.caption || img.name || img.description || '';
}

return '';
})();

const title =
bestNode.name ||
bestNode.headline ||
getMeta('meta[property="og:title"]') ||
getMeta('meta[name="twitter:title"]') ||
getMeta('meta[name="title"]') ||
document.title.trim() ||
'';

const description =
bestNode.description ||
getMeta('meta[property="og:description"]') ||
getMeta('meta[name="twitter:description"]') ||
getMeta('meta[name="description"]') ||
'';

const url =
ldUrl ||
ogUrl ||
canonicalUrl ||
location.href;

const imageSrc =
getAbsUrl(
imageFromLd ||
getMeta('meta[property="og:image:secure_url"]') ||
getMeta('meta[property="og:image"]') ||
getMeta('meta[name="twitter:image"]')
);

const imageAlt =
imageAltFromLd ||
getMeta('meta[property="og:image:alt"]') ||
getMeta('meta[name="twitter:image:alt"]') ||
'';

const handle = (() => {
try {
return new URL(url).pathname.split('/').filter(Boolean).pop() || '';
} catch {
return '';
}
})();

const row = {
'title': title,
'description': description,
'url': url,
'image src url': imageSrc,
'image alt text': imageAlt,
'handle': handle,
'url to the archived version of the collection page': location.href
};

const csv = [
headers.join(','),
headers.map(h => JSON.stringify(row[h] ?? '')).join(',')
].join('\n');

console.log('Row object:', row);
console.log(csv);

return csv;
})();


Paste it into the console and press Enter.

Step 4: Copy the result

After running the script, you will see two outputs in the console:

1. A row object

This is a readable preview of the extracted data.

2. A CSV result

This is the line you need to copy. It will look something like this:

title,description,url,image src url,image alt text,handle,url to the archived version of the collection page
"Abstract","Abstract collection","https://www.example.com/collections/abstract","https://www.example.com/image.jpg","","abstract","https://web.archive.org/..."

Copy the CSV output.

Step 5: Paste into a spreadsheet

Open Excel or Google Sheets and paste the CSV output.

If you are collecting multiple pages, repeat the same process for each archived collection page and paste each new row underneath the previous one.

What the script checks

The script looks for metadata in several places on the page, then uses the best available value.

It checks:

  • structured data in application/ld+json

  • Open Graph tags like og:title, og:description, og:image

  • Twitter meta tags

  • standard meta description

  • canonical URL

  • the current archived page URL

This helps it return useful values even if some metadata is missing.

Troubleshooting

Nothing happens

Make sure you pasted the full script and pressed Enter.

The result is blank

Some archived pages do not include all metadata. The script will still return whatever it can find.

The URL looks like the archived page instead of the original page

That usually means the original collection URL was not available in the page metadata, so the script used the best fallback.

Tips

  • Run the script on the exact collection page, not on the collection list or homepage

  • Use the CSV output, not the preview object, when pasting into a spreadsheet

  • Keep the header row only once if you are combining results from many pages

Need to collect many pages?

If you are working through a large list of archived collection pages, here is the condensed steps from this guide:

  1. open a page

  2. run the script

  3. copy the CSV row

  4. paste it into your spreadsheet

  5. repeat for the next page

Did this answer your question?