The Emily18 Com series (2021) comprises a comprehensive set of 1 018 digital artifacts ranging from high‑resolution images and audio clips to annotated textual transcripts. Although the series was released publicly in late 2021, no systematic scholarly assessment of its content, structure, or potential research applications has yet been published. This paper presents a descriptive and exploratory analysis of the full 2021 collection. We detail the provenance of the data, the taxonomy of its constituent items, and the preprocessing pipeline required for reproducible research. Using a combination of statistical profiling, network‑based clustering, and topic modelling, we uncover three dominant thematic clusters—Personal Narrative, Technical Demonstration, and Community Interaction—and illustrate how these clusters map onto temporal release patterns. The findings highlight the collection’s value for studies in digital humanities, media archaeology, and human‑computer interaction, and we provide a publicly available Python‑based toolkit to facilitate further investigation.
The Emily18 Com Full Sets – 2021 archive constitutes a rich, multimodal corpus whose internal structure can be systematically described and analysed. Our exploratory pipeline reveals three well‑defined thematic clusters that correspond to distinct temporal phases of production. By openly sharing our processing scripts (GitHub: github.com/Emily18/2021‑full‑sets‑analysis) and the derived feature matrices, we invite the broader research community to build upon this foundational work. Emily18 Com Full Sets -2021-
| Step | Tool | Description |
|------|------|-------------|
| File catalogue | os, pathlib | Generated a master CSV (catalogue.csv) with file path, type, size, SHA‑256. |
| Image feature extraction | torchvision (ResNet‑50) | Produced a 2048‑dimensional embedding for each PNG. |
| Audio feature extraction | librosa | Computed MFCCs (20 coefficients, mean & variance) → 40‑dim vector. |
| Text preprocessing | spaCy (en_core_web_md) | Tokenisation, lemmatisation, stop‑word removal; generated TF‑IDF vectors (max‑features = 2 000). |
| Metadata aggregation | pandas | Merged JSON fields (author, tags, release_date) into a unified table. | The Emily18 Com series (2021) comprises a comprehensive
All intermediate artefacts are stored under /processed/ and version‑controlled with Git LFS. The Emily18 Com Full Sets – 2021 archive
Within each cluster, we ran Latent Dirichlet Allocation (LDA) on the TF‑IDF matrix (n_topics=5). The top ten terms per topic were inspected manually.
| Media type | % of collection | Avg. size | Median size | |------------|----------------|-----------|-------------| | Images | 30.7 % | 4.3 MB | 4.0 MB | | Audio | 18.1 % | 14.2 MB | 12.8 MB | | Transcripts| 27.1 % | 0.32 MB | 0.28 MB | | Metadata | 24.1 % | 0.001 MB | 0.001 MB |
The collection is heavily multimodal: 57.8 % of items contain visual or auditory content, while 27 % are pure textual artefacts.