
Shen Qianli developed advanced data attribution and relevance filtering capabilities for the Data-Juicer repository, focusing on improving data quality for downstream machine learning tasks. By designing and implementing new operators such as in-context influence, instruction-following difficulty, LLM perplexity, task relevance, and text embedding similarity filters, Shen enhanced the system’s ability to assess both linguistic and task-specific data relevance. The work leveraged Python and YAML, applying skills in data filtering, LLM integration, and machine learning operations. This feature-rich update strengthened data quality signals, enabling more accurate attribution and relevance assessment without introducing major bugs, reflecting a thoughtful and robust engineering approach.

July 2025 Monthly Summary: Delivered Advanced data attribution and relevance filtering in Data-Juicer, introducing new operators to enhance data analysis and refinement. Implemented filters include in_context_influence_filter, instruction_following_difficulty_filter, llm_perplexity_filter, llm_task_relevance_filter, and text_embd_similarity_filter to improve linguistic and task-specific relevance assessment. Major bugs fixed: none documented this month. Impact: Strengthened data quality signals to improve downstream model training and evaluation, enabling more accurate attribution and relevance assessment and better decision-making. Technologies/skills demonstrated: operator design, data attribution, relevance filtering, ML data tooling, Python, commit-driven development. Commit reference: 950caf1f6b71782b842a4f38605cc474804ffcd2 in repo modelscope/data-juicer.
July 2025 Monthly Summary: Delivered Advanced data attribution and relevance filtering in Data-Juicer, introducing new operators to enhance data analysis and refinement. Implemented filters include in_context_influence_filter, instruction_following_difficulty_filter, llm_perplexity_filter, llm_task_relevance_filter, and text_embd_similarity_filter to improve linguistic and task-specific relevance assessment. Major bugs fixed: none documented this month. Impact: Strengthened data quality signals to improve downstream model training and evaluation, enabling more accurate attribution and relevance assessment and better decision-making. Technologies/skills demonstrated: operator design, data attribution, relevance filtering, ML data tooling, Python, commit-driven development. Commit reference: 950caf1f6b71782b842a4f38605cc474804ffcd2 in repo modelscope/data-juicer.
Overview of all repositories you've contributed to across your timeline