Off-platform lift · DataCatalog cross-listing · v1.0.0

Hugging Face Datasets submission

The 159-row indie saas teardowns dataset ships at /dataset with full Dataset JSON-LD, citation, BibTeX, and per-table CSVs. This page is the canonical handoff surface for mirroring it to Hugging Face Datasets – one of the recognised catalogs Google Dataset Search ranks favourably and a second acquisition surface for the same corpus.

Five-step submission

  1. Create the HF dataset repo. huggingface.co/new-dataset. Suggested owner+name: unlocksaas/indie-saas-teardowns. Visibility: Public. License: cc-by-4.0.
  2. Download the canonical dataset card.
    curl -O https://unlocksaas.com/dataset/huggingface/raw

    The response sets Content-Disposition to filename="README.md", so the file lands ready to upload as-is.

  3. Upload the CSVs to the repo root. Five per-table CSVs, downloadable from /dataset:

    HF Datasets Server auto-converts each CSV to Parquet on push – no manual conversion step.

  4. Set the activation env var on Vercel.
    vercel env add NEXT_PUBLIC_UNLOCKSAAS_HUGGINGFACE_DATASET_URL production
    # Paste the HF repo URL when prompted, for example:
    #   https://huggingface.co/datasets/unlocksaas/indie-saas-teardowns

    Repeat for the preview environment if you want the cross-listing visible on preview deploys too.

  5. Redeploy and verify. The Dataset JSON-LD on /dataset now declares the HF DataCatalog cross-listing automatically. Verify with the Google Rich Results Test against /dataset – the schema graph should show one includedInDataCatalog entry with name: "Hugging Face Datasets". Google Dataset Search re-ingests on the next crawl (typically 1–7 days).

YAML frontmatter preview

This is the exact YAML block the HF Hub parses to populate its search facets, the auto-Parquet conversion, and the dataset-card metadata panel. Identical to what /dataset/huggingface/raw serves between its --- delimiters.

---
pretty_name: Indie SaaS Teardowns Dataset
language:
  - en
license: "cc-by-4.0"
license_link: "https://creativecommons.org/licenses/by/4.0/"
size_categories:
  - n<1K
task_categories:
  - "text-classification"
  - "text-retrieval"
tags:
  - saas
  - "indie-hackers"
  - marketing
  - "funnel-analysis"
  - "pricing-analysis"
  - comparison
  - "russell-brunson"
  - "value-ladder"
  - editorial
  - "honest-claims"
source_datasets:
  - original
doi: "10.5281/zenodo.20315741"
configs:
  - config_name: funnel_teardowns
    data_files:
      - split: train
        path: "funnel-teardowns.csv"
  - config_name: pricing_teardowns
    data_files:
      - split: train
        path: "pricing-teardowns.csv"
  - config_name: comparisons
    data_files:
      - split: train
        path: comparisons.csv
  - config_name: alternatives
    data_files:
      - split: train
        path: alternatives.csv
  - config_name: categories
    data_files:
      - split: train
        path: categories.csv
---

Google Dataset Search verification

Google Dataset Search ingests the schema.org Dataset JSON-LD rendered on /dataset – there is no separate submission portal. Three things make a Dataset eligible for inclusion:

  • The page is indexable and the canonical URL is stable. Confirmed – /dataset ships robots: index, follow and is listed in /sitemap.xml.
  • The Dataset JSON-LD carries name, description, license, creator, distribution, dateModified, keywords, variableMeasured, and identifier. Confirmed – every field is sourced from the canonical dataset module and includes measurementTechnique describing the editorial method.
  • Cross-catalog corroboration via includedInDataCatalog. Activates automatically once the HF env var lands. Until then the Dataset schema lists the canonical landing as its only sameAs – still eligible, but without the catalog-cross-listing rank lift.

Use the Google Dataset Search UI to query the canonical name once Google has crawled the updated schema. Typical end-to-end propagation after the HF env var is set: 24 hours for the schema, 1–7 days for Dataset Search re-ingestion.

Brunson Hard-Rule

The HF dataset card body and YAML frontmatter both derive from the same module that drives /dataset and its JSON-LD. The HF mirror cannot drift from the canonical site by construction – every row count, license string, citation, and column contract is read once at module load.

The HF cross-listing itself is operator-gated. The Dataset JSON-LD declares the catalog only when the env var resolves to a valid https:// URL. A missing or malformed value is silently skipped – the schema validator never sees a fabricated catalog claim.

Author: Maryan, Founder, Unlock SaaS

Last verified: 2026-05-18

Next editorial review: 2026-08-16

Raw README.md: /dataset/huggingface/raw