-
@ himajin
2025-05-29 18:19:31Experiment Plan for Text Similarity Comparison Algorithms (Revised v3)
1. Introduction
1.1. Research Background and Objectives
This research aims to evaluate the performance of various algorithms for comparing the similarity between individual page texts extracted from a specific technical document (in this experiment, the content of the Tailwind CSS documentation site). Initially, we considered dividing the text into 250-word chunks. However, due to the abundance of Markdown and code in the target document, meaningful chunking proved tobe cumbersome. Therefore, we decided to use the entire text extracted from each page as the unit of comparison.
This study will systematically compare and examine combinations of different "representation methods" and "comparison methods" from multiple perspectives: ease of implementation, processing speed, memory consumption, and accuracy of similarity judgment. A particular focus will be on elucidating the effectiveness of information-based NCD (Normalized Compression Distance) and vector embedding-based methods, which are planned for future evaluation.
1.2. Report Structure
This report will first describe the experimental data and its preprocessing methods. Next, it will define in detail the representation methods and comparison methods that form the axes of evaluation, and present specific experimental cases combining them. After presenting the results and discussion of initial experiments using NCD, it will describe the metrics for evaluating each experimental case, specific experimental procedures, and the expected outcomes and future prospects of this research.
2. Experimental Data
- Target Content: Individual HTML pages from the Tailwind CSS documentation site (
tailwindcss.com
). - Data Unit: The entire text of each page, extracted from HTML files using the
html2text
command and further processed to remove control characters using thesed
command. This serves as the basic unit of comparison in this experiment. - Data Storage Location: The extracted and preprocessed text files are stored locally under the
./tailwindcss.com
directory, maintaining the original file structure. - Language: English
- Example Search Query: A representative search query for this experiment is
"Utilities for controlling how a background image behaves when scrolling."
(Multiple queries and their expected similar pages may be used for more robust evaluation). - Example Expected Similar Page: For the query above,
/docs/background-attachment
is expected to be the most semantically similar page.
3. Experimental Design
This experiment is designed by dividing the process of evaluating text similarity into two main axes: "Representation Methods" and "Comparison Methods."
3.1. Representation Methods (Text Quantification/Vectorization)
-
Naive (Raw Text / Full Page Text)
- Method: Use the entire preprocessed text extracted from each document page as raw string data, without special transformations.
- Objective: Serve as a direct input for information-based comparison methods like NCD and as a baseline comparison for more advanced representation methods to be evaluated later.
-
(Future Experiment) Vector Embedding via Gemini API (Embedding-Gemini)
- Method: Utilize Google's Gemini API (
models/text-embedding-004
) to convert the entire text of each page into high-dimensional dense vectors (Embeddings). - Objective: Evaluate the performance of context-rich vector representations generated by a state-of-the-art large language model.
- Method: Utilize Google's Gemini API (
-
(Future Experiment) Vector Embedding via Local Lightweight Model (Embedding-MiniLM-GGUF)
- Method: Run a GGUF quantized version of the pre-trained
all-MiniLM-L6-v2
model (all-MiniLM-L6-v2-Q5_K_M.gguf
) in a local environment to convert the entire text of each page into vector representations. GGUF format offers benefits like smaller model size and potentially faster CPU inference. - Objective: Evaluate the performance of a widely used open-source lightweight model (quantized version) in comparison to API-based large-scale models and domain-specific learned models.
- Method: Run a GGUF quantized version of the pre-trained
-
(Future Experiment) Extraction of Internal Feature Vectors via Mathematica (Embedding-MMA)
- Method: Use the entire page texts from the target document set as input. Employ Mathematica's neural network framework to first pass each page text through an Embedding Layer. Apply L2 normalization to the resulting vectors, followed by Principal Component Analysis (PCA) to reduce dimensionality to approximately 100 dimensions. This final vector will be the feature vector. This pipeline aims to create dense, normalized representations specific to the document corpus, with PCA helping to capture the most significant variance in a lower-dimensional space, potentially improving efficiency and reducing noise. The choice of an Embedding Layer trained or fine-tuned on the corpus, followed by PCA, seeks to balance domain-specificity with robust dimensionality reduction.
- Objective: Evaluate the performance of vector representations processed or specialized for the target document set.
3.2. Comparison Methods (Distance/Similarity Calculation between Representations)
-
NCD (Normalized Compression Distance)
- Applicable to: Naive (Full Page Text)
- Method: For two data objects
x
(query) andy
(document page text), calculateNCD(x,y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
. Here,C(s)
is the size (e.g., byte length) of datas
after compression with a specific algorithm, andC(xy)
is the size of the concatenated datax
andy
after compression. A value closer to 0 indicates higher similarity. - Compression Algorithms to Compare: DEFLATE (gzip), bzip2, LZMA, XZ, Zstandard (zstd), LZO, Snappy, LZ4 (as used in the user-provided script).
- Objective: Evaluate similarity from an information-theoretic perspective based on data commonality and redundancy. Compare the impact of different compression algorithms on NCD results.
-
(Future Experiment) Cosine Similarity
- Applicable to: Embedding-Gemini, Embedding-MiniLM-GGUF, Embedding-MMA
- Method: Calculate the cosine of the angle between two vectors.
- Objective: Standard similarity evaluation based on the directionality (semantic closeness) of vector representations.
-
(Future Experiment) Euclidean Distance
- Applicable to: Embedding-Gemini, Embedding-MiniLM-GGUF, Embedding-MMA
- Method: Calculate the straight-line distance between two vectors in a multidimensional space.
- Objective: Similarity evaluation based on the absolute positional relationship of vector representations.
-
(Future Experiment) Manhattan Distance (L1 Distance)
- Applicable to: Embedding-Gemini, Embedding-MiniLM-GGUF, Embedding-MMA
- Method: Calculate the sum of the absolute differences of their Cartesian coordinates.
- Objective: Similarity evaluation based on axis-aligned travel distance, differing from Euclidean distance.
-
(Future Experiment) Mahalanobis Distance
- Applicable to: Embedding-Gemini, Embedding-MiniLM-GGUF, Embedding-MMA
- Method: Calculate the distance between two vectors considering the covariance of the data. This provides a distance metric that accounts for the scale differences and correlations of each feature (vector dimension).
- Objective: More robust similarity evaluation that considers the structure (correlation) of the feature space.
3.3. Experimental Cases (Initial NCD Experiments and Future Expansion)
3.3.1. Initial Experiments Conducted (NCD)
The following experimental cases were conducted using the user-provided script. The representation method was "Naive (Full Page Text)."
| No. | Representation Method | Comparison Method (Distance/Similarity Metric) | Notes | | :-: | :---------------------- | :--------------------------------------------- | :-------------------------- | | 1 | Naive (Full Page Text) | NCD (gzip/DEFLATE) | One of the baselines | | 2 | Naive (Full Page Text) | NCD (bzip2) | Compression method comparison | | 3 | Naive (Full Page Text) | NCD (lzma) | Compression method comparison | | 4 | Naive (Full Page Text) | NCD (xz) | Compression method comparison | | 5 | Naive (Full Page Text) | NCD (zstd) | Compression method comparison | | 6 | Naive (Full Page Text) | NCD (lzop) | Compression method comparison | | 7 | Naive (Full Page Text) | NCD (snappy) | Compression method comparison | | 8 | Naive (Full Page Text) | NCD (lz4) | Compression method comparison |
3.3.2. Future Experimental Plan (Vector Embedding)
| No. | Representation Method | Comparison Method (Distance/Similarity Metric) | Notes | | :--: | :---------------------- | :--------------------------------------------- | :------------------------------------- | | 9 | Embedding-Gemini | Cosine Similarity | Standard vector similarity evaluation | | 10 | Embedding-Gemini | Euclidean Distance | Standard vector similarity evaluation | | 11 | Embedding-Gemini | Manhattan Distance | Axis-aligned distance similarity eval. | | 12 | Embedding-Gemini | Mahalanobis Distance | Distance considering feature structure | | 13 | Embedding-MiniLM-GGUF | Cosine Similarity | Evaluation of local lightweight model | | 14 | Embedding-MiniLM-GGUF | Euclidean Distance | Evaluation of local lightweight model | | 15 | Embedding-MiniLM-GGUF | Manhattan Distance | Evaluation of local lightweight model | | 16 | Embedding-MiniLM-GGUF | Mahalanobis Distance | Evaluation of local lightweight model | | 17 | Embedding-MMA | Cosine Similarity | Eval. of domain-specific MMA model | | 18 | Embedding-MMA | Euclidean Distance | Eval. of domain-specific MMA model | | 19 | Embedding-MMA | Manhattan Distance | Eval. of domain-specific MMA model | | 20 | Embedding-MMA | Mahalanobis Distance | Eval. of domain-specific MMA model |
4. Results and Discussion of Initial NCD Experiments (Based on User-Provided Information)
4.1. Execution Overview
The user employed provided Python scripts (
main.py
,comparison.py
) to calculate NCD between a search query and the entire text extracted from each HTML page in the./tailwindcss.com
directory.main.py
invokedcomparison.py
with various compression commands (gzip
,bzip2
,lzma
,xz
,zstd
,lzop
,lz4
).comparison.py
then used the specified command-line compression tools to compute NCD scores and output the results to CSV files.Search Query:
"Utilities for controlling how a background image behaves when scrolling."
Expected Similar Page:/docs/background-attachment
4.2. Key Results
The pages judged as most similar (lowest NCD score) to the query for each compression algorithm were as follows (based on user-provided sorted results):
- Zstandard (zstd):
./tailwindcss.com/docs/background-attachment
(Score: 0.973...) - LZ4:
./tailwindcss.com/docs/background-attachment
(Score: 0.976...) - XZ:
./tailwindcss.com/docs/background-origin
(Score: 0.946...) - LZMA:
./tailwindcss.com/docs/background-origin
(Score: 0.966...) - gzip (DEFLATE):
./tailwindcss.com/docs/scroll-behavior
(Score: 0.969...) - LZO:
./tailwindcss.com/docs/scroll-behavior
(Score: 0.955...) - bzip2:
./tailwindcss.com/docs/mask-clip
(Score: 0.958...)
4.3. Initial Discussion
- Variation in Results by Compression Algorithm: It was confirmed that the document judged most similar to the query varies depending on the compression algorithm used. This is likely due to the differing abilities of each algorithm to capture various types of redundancy and patterns within the text.
- Alignment with Expected Results: When using Zstandard and LZ4, the expected page (
/docs/background-attachment
) was judged as most similar. This suggests these compression algorithms may have relatively effectively captured the information-theoretic commonality between the query and the target document in this instance. - Range of NCD Scores: The reported NCD scores were generally close to 1.0. This may be due to the relatively short length of the search query compared to the full-page documents, meaning the query text contributes less to the overall compressibility when concatenated. However, relative differences were still captured, enabling ranking.
- Validity of Full-Page Comparison: Full-page comparison was chosen due to the difficulty of chunking content rich in Markdown and code. While this approach simplifies preprocessing, it may also be influenced by the overall structure of the page, including common headers and footers.
This initial experiment indicates that NCD can function as an indicator of text similarity and that the choice of compression algorithm is crucial.
5. Evaluation Metrics (Including Future Experiments)
-
Accuracy of Similarity Scores:
- Ground Truth Preparation: A small, diverse subset of page pairs (e.g., 50-100 pairs) will be selected. For each pair, at least two evaluators familiar with the Tailwind CSS documentation will independently assign a similarity score on a 5-point Likert scale (1=Not similar, 5=Very similar). Inter-evaluator reliability (e.g., using Krippendorff's Alpha) will be calculated. Disagreements will be resolved through discussion to create a consensus ground truth dataset. If resource-constrained, a single-evaluator approach with clear, predefined criteria will be used, acknowledging this limitation. Alternatively, page pairs likely to be similar will be selected based on internal references or chapter structure within the document.
- Evaluation Metrics: Ranking evaluation (Precision@k, Recall@k, MAP: Mean Average Precision), correlation analysis (Spearman's rank correlation coefficient with human judgments), classification evaluation (AUC-ROC, F1-score, assuming appropriate thresholding).
-
Processing Speed:
- Average time to calculate similarity for a page pair, total calculation time for all page pairs (or a large sampled set), and representation generation time (API call time, local model inference time, MMA processing time).
-
Memory Consumption:
- Model size (MiniLM-GGUF, MMA model), data representation size, and peak runtime memory usage.
-
Ease of Implementation:
- Qualitative assessment of setup ease, lines of code, required libraries, difficulty of parameter tuning, and documentation quality. This will be summarized for each approach (e.g., using a rubric or a comparative narrative) considering factors like:
- Setup Complexity: (e.g., API key acquisition vs. local model download & environment setup vs. full model training pipeline in Mathematica).
- Code Complexity: Estimated lines of core logic, reliance on external vs. standard libraries.
- Parameter Sensitivity: Number of key hyperparameters requiring tuning and the perceived difficulty of finding good settings.
- Documentation & Community Support: Availability and clarity of official documentation and community resources (e.g., forums, GitHub issues).
- Qualitative assessment of setup ease, lines of code, required libraries, difficulty of parameter tuning, and documentation quality. This will be summarized for each approach (e.g., using a rubric or a comparative narrative) considering factors like:
6. Experimental Procedure (Including Future Experiments)
-
Data Preparation:
- Prepare HTML files of the target document in the
./tailwindcss.com
directory. - (For NCD) Extract and preprocess full-page plain text from each HTML file using
html2text
andsed
(as previously done by the user). - (For Vector Embedding) Use the same preprocessed full-page plain text.
- Create ground truth data for accuracy evaluation as described in Section 5.1.
- Prepare HTML files of the target document in the
-
Implementation and Execution of Representation Methods:
- Naive: Use the preprocessed page text directly.
- Embedding-Gemini: Use Python's
requests
library or similar to send each page text to the Gemini API (models/text-embedding-004
) and retrieve/store the vector representations. - Embedding-MiniLM-GGUF: Use appropriate libraries (e.g.,
ctransformers
, orsentence-transformers
combined withllama-cpp-python
) to load theall-MiniLM-L6-v2-Q5_K_M.gguf
model. Input each page text to extract and store vector representations. - Embedding-MMA: In Mathematica, apply an Embedding Layer to each page text, followed by L2 normalization and PCA dimensionality reduction (to approx. 100 dimensions), then extract and store the vector representations.
-
Implementation and Execution of Comparison Methods:
- NCD: Refer to the user-provided Python scripts (
main.py
,comparison.py
) to call various command-line compression tools for NCD calculation. Alternatively, extend this to directly use Python's compression libraries for better control and efficiency. - Cosine Similarity, Euclidean Distance, Manhattan Distance: Implement using standard math libraries (e.g., Python's NumPy, SciPy).
- Mahalanobis Distance: Implement using
scipy.spatial.distance.mahalanobis
. Requires pre-calculation of the covariance matrix (or its inverse) from the entire dataset of vectors for each embedding type.
- NCD: Refer to the user-provided Python scripts (
-
Evaluation Execution:
- Calculate similarity (or distance) scores between the search query (and potentially between page pairs for ground truth evaluation) and all document pages for each experimental case.
- Measure processing speed and memory consumption.
- Calculate accuracy metrics using the computed similarity scores and ground truth data.
- Record and evaluate the ease of implementation.
-
Result Aggregation and Analysis:
- Compile the obtained evaluation metrics into tables and graphs for comparative analysis of each method's characteristics.
Experimental Environment (Assumed)
- Hardware: (e.g., CPU: Intel Core i7-10700, Memory: 32GB RAM, GPU: NVIDIA GeForce RTX 3070 8GB - specify if GPU is used for MiniLM or MMA)
- Software: (e.g., OS: Linux (Ubuntu, etc.), Programming Language: Python 3.x (with versions for key libraries like NumPy, SciPy, requests, ctransformers, etc.), Mathematica 13.x, specific versions of command-line compression tools if used directly)
7. Expected Outcomes and Future Outlook
This research (including initial NCD experiments and future vector embedding experiments) is expected to yield the following outcomes:
- Clarification of the Impact of Compression Algorithms on NCD: As indicated by initial experiments, the choice of compression algorithm significantly affects similarity judgments. Further validation with more diverse data and queries will allow for a deeper understanding of each algorithm's characteristics.
- Performance Characteristics of Various Methods on Full-Page Text: To clarify how NCD and various vector embedding methods perform in terms of accuracy, speed, and resource consumption when applied to entire page texts.
- Comparison of Local and API-Based Models: In future vector embedding experiments, to compare the performance, speed, and resource efficiency of
Embedding-MiniLM-GGUF
(local, quantized) andEmbedding-Gemini
(API, large-scale) to identify practical trade-offs. - Evaluation of Domain-Specific Embedding Effectiveness: To assess how well
Embedding-MMA
, processed or tuned for a single technical document set, performs compared to general-purpose models. - Provision of Practical Insights: To offer guidelines for selecting appropriate similarity comparison approaches based on text characteristics (e.g., Markdown/code content) and system requirements (e.g., ease of preprocessing, emphasis on accuracy vs. speed).
8. Future Challenges
- Ensuring Quality of Ground Truth Data: Evaluating full-page similarity can be more subjective than chunk-level evaluation, making the creation of high-quality ground truth data challenging. Establishing clear annotation guidelines and measuring inter-annotator agreement will be crucial.
- Hyperparameter Optimization: Many methods involve tunable parameters (e.g., Embedding-MMA model structure, PCA dimensionality, MiniLM-GGUF inference parameters), the optimization of which may be beyond the scope of this initial study. The impact of default vs. tuned parameters could be noted.
- Noise in Full-Page Comparison: Full-page texts may contain common navigational elements or boilerplate text that could act as noise in similarity judgments. Strategies to mitigate this (e.g., more advanced text extraction, or methods robust to such noise) could be a future research direction.
- Input Length Limitations of Vector Embedding Models: Very long page texts might exceed the input length limits of some vector embedding models, requiring strategies for handling. These might include:
- Truncation: Using only the initial N tokens of each page, which is simple but may lose crucial information.
- Summarization: Employing an abstractive or extractive summarization model to create a condensed version of the page, which could preserve key information but adds another layer of processing and potential information loss/bias.
- Chunking and Averaging/Pooling: Dividing long pages into manageable chunks, embedding each chunk, and then aggregating these chunk embeddings (e.g., by averaging) to get a single page vector. This approach needs careful consideration of how chunks are defined and aggregated.
- Utilizing Long-Context Models: If available and feasible, leveraging embedding models specifically designed for longer sequences. The chosen strategy will be documented, and its potential impact on results acknowledged.
- Target Content: Individual HTML pages from the Tailwind CSS documentation site (